tm 


Freedom  of  Information  Collection 

Research  and  Information  Services 

Ontario  Ministry  of  Finance,  Frost  Bldg  North,  Main  fl 

Does  not  Circulate 

L— .  . . . ... 


RELIABILITY  OF  THE  1971  CENSUS  DATA 


June  1972 


Prepared  by 

Central  Statistical  Services 
with  the 
assistance  of 
Statistics  Canada 


I  c  .  2 
tor  mai 


Digitized  by  the  Internet  Archive 
in  2018  with  funding  from 
Ontario  Council  of  University  Libraries 


https://archive.org/details/reliabilityof19700onta 


ministry  of  treasury,  economics  and 

INTERGOVERNMENTAL  AFFAIRS 

PARLIAMENT  BUILDINGS  °  TORONTO,  ONTARIO 


ONTARIO 


RELIABILITY  OF  THE  1971  CENSUS  DATA 


June  1972 


Prepared  by 

Central  Statistical  Services 
with  the 
ass  is  tance  of 
Statistics  Canada 


INDEX 


Errors  Associated  with  Censuses  -  Shashi  Sharma  ...........  1 

Evaluation  and  Measurement  of  Errors  -  Gordon  Brackstone  ..  13 

Improved  Census  Methodology  and  its  Impact  on 

Quality  of  Data  -  Shashi  Sharma  . . .  39 

Sampling  and  its  Impact  on  Reliability  of 

Census  Data  -  Gordon  Brackstone  ...........................  58 

Confidentiality  and  its  Effect  on  Census  Data  - 

Dr.  Mike  Murphy  . . . .  65 


ERRORS  ASSOCIATED  WITH  CENSUSES 


Shashi  Sharma 


ERRORS  ASSOCIATED  WITH  CENSUSES 


Census  data  constitute  the  most  important  single  source  of 
demographic  and  socio-economic  information,  basic  to  a  wide 
range  of  requirements  for  policy,  analytical,  administrative, 
and  other  purposes.  For  this  reason,  the  utmost  care  and 
effort  is  expended  in  the  design  and  implementation  of  the 
decennial  census.  Yet  most  users,  however  expert  in  the 
analysis  and  interpretation  of  statistics,  know  little  or 
nothing  of  the  ways  and  means  of  collecting,  processing, 
evaluating  and  presenting  the  data  which  they  use.  The  purpose 
of  this  seminar  is  to  outline  some  of  the  key  operational 
aspects  of  the  1971  Census  and  in  particular  to  explain  the 
role  ar.d  nature  of  error  in  census  data  so  that  the  user  may 
better  understand  the  nature  and  quality  of  the  information 
will  be  analysing. 

As  part  of  the  evaluation  programme  of  the  1971  Census,  a  detailed 
analysis  of  some  of  the  errors  in  the  census  results  is  being 
carried  out.  Census  figures  are  published  in  many  forms,  with 
a  minority  based  on  a  complete  census  and  the  vast  majority  based 
on  a  sample  of  households. 

Each  figure  obtained  in  the  census,  whether  it  be  based  on  a 
sample  or  on  complete  coverage,  is  subject  to  error.  These  errors, 


2 


sampling  or  non-sampling,  have  to  be  estimated  in  order  to 
establish  the  degree  of  reliability  to  be  placed  on  the  census 
figures.  Whereas  sampling  errors  measure  the  effect  of  obtaining 
an  estimate  from  a  sample  of  households  rather  than  from  all 
households,  non-sampling  errors  are  made  up  of  such  things  as 
coverage  errors,  response  errors  and  processing  errors. 

NON-SAMPLING  ERRORS 

1 .  Coverage  errors 

Although  the  intention  is  to  include  everyone  in  the  census  count, 
this  is  never  fully  achieved,  no  matter  what  the  method  of 
enumeration  may  be.  There  is  no  complete  list  of  persons  available 
before  the  census  which  could  be  used  as  a  control  device  to 
ensure  that  everyone  is  enumerated.  As  a  result,  one  must  rely 
on  enumerators  who  are  instructed  to  include  in  the  census  every 
dwelling  in  the  country  as  well  as  every  person  living  in  the 
dwelling  (both  private  and  other  types  of  dwellings).  It  is 
evident  that  bias  may  result  from  non-response  due  to  the  fact 
that  particular  groups  in  the  population  are  more  likely  to  be 
"missed"  than  others.  This  is  illustrated  by  reference  to  past 
census  results. 

In  1961  it  was  estimated  that  2.5  to  3  per  cent  of  the  total 
population  was  not  enumerated  in  the  census.  While  this  total 
is  far  from  satisfactory,  the  situation  was  in  fact  somewhat 


3 


worse  than  these  figures  may  indicate.  In  fact,  males  15-25 
years  of  age  (a  critical  age  group  for  many  labour  force  and 
education-oriented  questions)  were  missed  at  a  much  higher 
rate:  probably  more  than  10  per  cent.  Besides  age  and  sex,  the 
rate  of  missed  persons  also  varied  by  geographical  areas: 
indications  are  that  congested  downtown  areas  and  people  in  the 
lower  levels  of  socio-economic  strata  were  under-enumerated  by 
substantially  more  than  the  national  average.  While  it  is 
impossible  to  say  what  the  effect  of  such  an  under-enumerati on 
might  have  been  on  the  published  statistics,  it  is  certain  that 
it  substantially  biased  the  resulting  tabulations,  affecting 
both  their  level  and  composition. 

Coverage  errors  are  unevenly  distributed  with  respect  to  both 
geography  and  population  characteristics;  it  is  therefore  certain 
that  small-area  statistics  and  detailed  cross- tabu! ati ons  are 
more  distorted  by  it  than  are  larger  aggregates.  For  this  reason, 
distributions  and  patterns  may  also  be  somewhat  distorted  by 
coverage  errors. 

2.  Response  errors 

This  type  of  error  occurs  when  data  are  requested,  provided, 
received,  and  recorded.  Questions  may  be  misinterpreted  by  the 
respondent;  he  or  she  may  not  know,  may  not  remember,  or  may 


4 


purposely  want  to  distort  the  correct  answer.  Response  errors 
can  be  broken  down  Into  two  components;  response  bias  and 
response  variance.  The  basic  difference  is  that  the  former 
does  not  depend  on  the  cell  frequency,  while  the  latter  does. 
From  the  point  of  view  of  small-area  statistics  or  fine  break¬ 
downs  involving  few  observations,  the  response  variance  is  the 
more  relevant  term. 

(a)  Response  variance.  Response  variance  is  that  component  of 
the  total  response  error  which  tends  to  cancel  out  over  a  large 
number  of  responses.  Some  enumerators  may  tend  to  inflate  a 
count,  while  others  tend  to  deflate  it;  some  respondents  tend 
to  overstate  their  response  to  a  certain  question,  others  tend 
to  understate  it;  some  questions  on  the  questionnaire  may  be 
ambiguous  (or  difficult  to  communicate)  and  may  invite  misunder¬ 
standings  in  either  direction.  All  these  causes  (and  many 
others)  contribute  to  response  variance.  While  the  net  error 
from  this  source  may  be  quite  small  for  large  areas  (because 
positive  and  negative  errors  partly  cancel  out  as  the  population 
is  aggregated),  it  can  be  very  large  for  small  areas  or  rare 
characteristics.  It  also  tends  to  be  much  larger  for  sensitive 
characteristics  which  are  normally  difficult  to  measure  (eg. 
labour  force,  certain  occupations,  education,  income,  etc.). 


5 


Response  variance  was  measured  for  a  large  number  of 
characteristics  as  part  of  the  evaluation  of  the  1961  Census. 

A  few  points  need  to  be  emphasized  in  connection  with  the  response 
variance:  (i)  An  overwhelming  proportion  of  the  response 

variance,  which  the  1961  Census  statistics  were  subject  to,  was 
due  to  the  fact  that  the  census  was  taken  by  a  direct  enumerator 
canvass  method.  This  meant  that  the  country  was  divided  into 
small  areas  of  150  to  200  households  (called  Enumeration  Areas 
or  EAs),  and  each  of  these  was  assigned  to  an  enumerator.  The 
enumerator  collected  the  census  information  by  knocking  on  doors 
and  talking  to  any  responsible  member  of  the  household  who 
provided  information  for  the  whole  household  (typically,  this 
person  was  the  housewife).  It  has  been  observed  and  measured 
that  each  enumerator  has  personal  biases  which  contribute  signifi¬ 
cantly  to  the  error  of  the  statistics  he  collects.  These 
enumerator  biases  tend  to  cancel  out  over  a  large  number  of 
enumerators.  However,  small-area  statistics  and/or  detailed 
cross-classifications  are  based  on  the  work  of  a  few  enumerators 
and  in  this  case  the  chances  of  possibly  large  errors  not 
cancelling  out  are  substantial.  The  situation  is  analogous  to 
making  estimates  on  the  basis  of  extremely  small  samples  (one  or 
two  observations).  Hence,  the  impact  of  enumerators  on  small  area 
and/or  detailed  cross-classification  statistics  was  extremely  large 
in  1961  . 


6 


( 1 i )  In  interpreting  the  previous  paragraph,  one  needs  to  keep 
in  mind  that  the  census  enumerators  are  not  regular  Statistics 
Canada  employees  with  years  of  experience  behind  them.  On  the 
contrary,  in  1961,  over  thirty  thousand  enumerators  were  hired, 
trained,  supervised  in  their  enumeration  of  some  eighteen  million 
persons,  and  disbanded,  all  within  a  matter  of  one  to  two  months. 

(iil)  For  the  reasons  outlined  in  (i),  the  response  variance 
dominated  the  small  estimates,  i.e.,in  1961  it  was  by  far  the 
largest  source  of  error  of  the  small  area  or  detailed  cross- 
classification  data. 

(b)  Response  bias.  Response  bias  is,  roughly  speaking,  that 
portion  of  the  total  response  error  which  is  left  after  all  the 
cancellation  described  above  has  occurred.  It  is  made  up  of 
response  errors  which  have  a  tendency  to  occur  in  one  direction 
more  than  in  the  other.  It  may  be  the  result  of  the  training 
and  general  attitude  of  enumerators,  the  reluctance  of  respondents 
to  admit  certain  characteristics,  or  a  tendency  to  overstate  or  under¬ 
state  certain  types  of  activity.  It  may  be  a  fault  of  the 

questionnaire  which  may  Invite  errors  in  one  direction  more  than 

in  the  other.  Such  errors  do  not  cancel  out  even  over  large 

areas  or  large  counts  and  may  be  particularly  damaging  for  statistics 
at  the  metropolitan,  provincial,  or  national  level.  They  also 
have  a  tendency  to  be  much  larger  for  characteristics  which  are 
difficult  to  measure. 


In  order  to  derive  an  accurate  measure  of  response  bias  one  would 
have  to  compare  the  actual  census  results  with  "perfect"  results. 

In  addition,  the  comparison  would  have  to  be  made  for  large 
populations  since  otherwise  one  would  primarily  be  measuring 
differences  caused  by  response  variance  rather  than  by  response 
bias.  However,  "perfect"  results  are  Impossible  to  procure,  and 
we  thus  have  to  work  with  the  best  estimate  of  response  bias 
available.  Such  a  measure  has  been  derived  as  a  result  of 
empirical  studies  involving  re-enumeration  of  interviewed  house¬ 
holds,  both  in  the  United  States  and  in  Canada.  Such  studies 
indicate  that  the  overall  response  bias  is  of  the  order  of  6 
per  cent. 

Empirical  evidence  suggested  that  a  significant  proportion  of 
response  bias  may  stem  from  enumerator  intervention.  For  this  and 
other  reasons  self-enumeration  was  utilized  wherever  possible,  in 
the  1971  Census  operation. 

An  average  level  of  response  bias  is  approximately  equal  to  6  per 
cent.  This  average  level  of  response  bias  is  too  low  for  sensitive, 
dl f f i cul t- to-meas ure  characteristics  (such  as  certain  occupation  and 
other  labour  force  characteristics,  income,  education),  and  so  once 
again,  for  these  characteristics  the  errors  not  due  to  sampling 
could  be  understated. 


8 


3.  Time  lag  bias 

This  results  from  delays  in  making  published  data  available. 

Because  both  demographic  and  socio-economic  changes  occur,  the 
data  will  not  reflect  the  actual  situation  at  the  time  of  public¬ 
ation.  Basic  tabulations  were  produced  from  the  1961  Census  as 
late  as  five  years  after  the  census  date.  There  is  little  doubt  that 
when  census  data  are  used  for  important  decision-making  purposes, 
the  lack  of  timeliness  of  the  statistics  is  a  source  of  decision¬ 
making  error.  When,  on  the  basis  of  census  data,  an  urban  renewal 
area  is  delineated,  a  poverty  area  is  designated,  the  location  of 
a  retail  outlet  is  decided,  or  some  other  decision  is  taken,  it 
is  important  that  the  data  should  reflect  the  real  world  as  it  is 
(or  at  least  as  it  was  In  the  recent  past),  rather  than  as  it  was 
four  to  five  years  ago.  To  the  extent  that  the  census  data  do  not 
reflect  the  real  world  at  the  time  of  their  publication,  users  of 
the  data  have  taken  into  account  problems  of  timeliness  whose 
magnitude  Increases  with  the  passage  of  time. 

To  study  the  effect  of  the  passage  of  time  on  census  statistics,  a 
comparlsion  was  made  for  a  sample  of  identical  areas  of  the  total 
number  of  persons,  the  number  of  persons  in  two  age  groups  (0-4 
years,  20-24  years),  the  number  attending  school,  and  the  number 
of  wage-earning  family  heads.  The  sampled  areas  included  rural  as 
well  as  urban  areas.  The  following  highlights  (some  of  them  quite 
obvious)  emerged  from  the  study: 


9 


(i)  The  larger  the  delay  in  publishing  the  data,  the  less  they 
reflect  the  real  world.  While  the  size  and  direction  of  the 
change  with  the  passage  of  time  varies  from  area  to  area  and 
characteristic  to  characteristic,  the  tendency  is  unmistakable. 

(ii)  The  larger  the  area,  the  more  chance  there  is  for  offsetting 
and  partially  cancelling  changes  over  time.  Hence,  the  passage 
of  time  affects  large  area  statistics  to  a  lesser  extent. 

(iii)  Straight  demographic  characteristics  are  less  affected  than 
the  more  sensitive  measures  related  to  education  or  labour  force. 

(iv)  Urban  areas  are  more  affected  than  rural  ones. 

The  actual  Impact  of  the  timelag  between  the  publication  of  the 
census  data  and  the  reference  period  can  be  called  the  timelag 
bias.  It  Is  equal  to  the  difference  between  the  statistics  at  the 
time  of  the  census  and  the  time  of  publication,  Implicitly  assuming 
that  users  of  census  data  apply  them  as  though  they  were  current 
data.  Depending  on  the  variable  considered,  in  small  areas  it 
may  reach  as  high  as  20  per  cent  per  year. 


10 


4.  Processing  errors 

Error  is  also  accumulated  in  the  course  of  processing  (coding, 
document  reading,  imputation  for  edit  rejects  and  non-respondents, 
tabulations).  Moreover,  these  errors  affect  a  full  census  count 
more  than  a  census  whose  more  difficult  questions  are  collected 
on  a  sample  basis. 

The  sheer  volume  associated  with  a  full-count  census  diminishes 
the  possibility  of  tight  supervision,  careful  selection  of  staff, 
and  quality  control.  It  is  estimated  that  with  a  full-count  census, 
approxl mately  seventy-two  million  questionnaire  pages  would  have 
to  be  coded,  and  would  require  about  2,550  coding  clerks  and 
supervisors  for  a  period  of  four  months.  There  is  a  likelihood 
that  available  physical  facilities  would  necessitate  a  two-shift 
operation.  Given  the  fact  that  coding  (particularly  of  industry, 
occupation,  and  geographic  data)  is  a  complex  and  demanding  task, 
there  is  every  probability  that  a  significant  loss  in  quality  would 
occur  In  comparison  with  what  could  be  achieved  on  a  sample  basis. 
The  magnitude  of  processing  errors  was  not  measured  in  the  1961 
Census;  hence,  no  estimate  is  available  at  present. 

SAMPLING  ERRORS 

Sampling  error  measures  the  effect  of  basing  an  estimate  on  the 
results  obtained  from  a  sample  of  persons  rather  than  a  full  count. 


11 


The  size  of  the  sampling  error  depends  on  the  sample  size,  the 
sample  design,  and  the  method  used  to  estimate  the  totals.  It 
has  been  noted  that  the  introduction  of  sampling  may  reduce  the 
magnitude  of  some  of  the  non-sampling  errors.  This  is  primarily 
because  sampling  deals  with  smaller  numbers.  Consequently,  more 
rigorous  training,  more  checks,  and  tests  with  respect  to  coverage 
and  content,  can  be  afforded  at  the  enumeration  stage.  As  a 
smaller  staff  is  required  for  the  field  work,  coding,  and 
processing  of  data,  the  result  is  a  better  control  of  the  operations. 

In  spite  of  the  opportunity  for  reduction  in  the  non-sampling  errors 
which  sampling  affords,  the  total  error  for  certain  areas  and 
characteristics  will  be  quite  large  since  estimates  for  small 
areas  or  small  populations  resulting  from  detailed  cross-tabul ations 
will  be  subject  to  both  large  sampling  and  non-sampling  errors. 

In  sum,  the  size  of  the  number  of  items  in  a  cell  is  an  important 
factor  in  determining  the  relative  reliability  of  data.  Thus, 
with  respect  to  the  Issue  of  sampling  in  the  census,  the  important 
question  Is  the  relative  contribution  of  sampling  errors  compared 
with  other  sources  of  error. 

MATHEMATICAL  MODEL  FOR  MEAN  SQUARE  ERROR 

To  facilitate  the  discussion  regarding  Evaluation  of  Errors,  it  is 
necessary  to  introduce  a  technical  term:  The  Root  Mean  Square 
Error  (RMSE).  It  is  the  postive  square  root  of  mean  square  error 
which  consists  of  the  following  components: 


12 


1)  Sampling  variance  when  data  is  collected  on  a  sample 
bas 1 s  ; 

1 1  )  Response  variance  ; 
ill)  Square  of  the  response  bias; 

1v)  Square  of  the  time  bias,  when  the  time-lag  bias  is  considered. 

In  the  case  of  a  full  count  census  the  sampling  error  is  equal 
to  zero.  The  relative  magnitude  of  the  root  mean  square  error 
measures  the  accuracy  of  sample  data  as  compared  with  full  count 
data.  When  cost  Is  considered  together  with  the  RMSE,  a  measure 
of  efficiency  of  estimates  is  provided. 

The  RMSE  Is  a  statistical  measure  and  it  has  a  property  that  an 
estimate  plus  or  minus  twice  its  RMSE  represents  an  interval  which 
would  contain  the  true  (but  unknown)  number  with  approximately  95% 
probability.  The  coefficient  of  variation  (CV)  expresses  the 
RMSE  as  a  percentage  of  the  estimate  to  which  it  refers  (i.e.  whose 
'error'  It  measures).  Both  these  measures  RMSE  and  CV  will  be  used 
extensively  in  measurement  of  errors. 


EVALUATION  AND  MEASUREMENT  OF  ERRORS 


Gordon  Brackstone 


14 


1971  CENSUS:  EVALUATION  AND  MEASUREMENT  OF  ERRORS 

1.  INTRODUCTION 

There  are  two  basic  questions  one  would  wish  to  ask  about  any  major  survey: 

(i)  How  reliable  is  the  resulting  data? 

(ii)  How  efficiently  was  it  collected? 

Answering  these  two  questions  about  the  1971  Census  is  the  main  function  of  the 
1971  Census  Evalution  Programme. 

The  measurement  of  the  quality  of  the  resulting  data  is  an  essential  part  of  any 
survey.  It  is  particularly  important  in  the  case  of  a  national  Census. whose  results 
are  so  widely  used.  If  the  Census  is  to  be  used  to  provide  data  on  which  to  base 
planning  decisions,  or  to  provide  a  base  for  future  projections,  it  is  important 
to  know  just  how  reliable  this  data  really  is.  It  is  also  important  for  Census 
personnel  to  be  aware  of  the.  magnitude  of  errors  in  Census  data,  and  how  they 
entered,  so  that  remedial  action  can  be  taken  for  future  Censuses.  Thus  the  first 
objective  of  the  Evaluation  Programme  can  be  stated: 

To  provide  information  on  the  size  and  types  of  errors  occurring  in  Census 
data  so  that  (a)  users  can  be  informed  of  the  reliability  and  limitations 

of  Census  figures,  and 

(b)  Census  personnel  can  be  made  aware  of  those  aspects  of  the 
Census  operation  in  which  improvements  need  to  be  made. 

Of  equal  importance  to  Census  personnel  is  the  need  to  evaluate  the  performance 
and  efficiency  of  the  various  individual  Census  operations,  not  only  in  terms 
of  their  effect  on  the  quality  of  the  final  data,  but  also  in  terms  of  their  costs, 
timeliness,  and  overall  efficiency  .  Thus  the  second  objective  of  the  Evaluation 
Programme  is: 


I 


•<-3  15 


To  provide  inf ormation  on  the  performance  and  efficiency  of  the  various 
Census  operations  so  that  an  accurate  picture  of  the  1971  Census  experience 
.can  form  the  basis  for  planning  future  Censuses. 

The  Evaluation  Programme  consists  of  over  50  individual  projects  each  aimed 
at  a  particular  aspect  of  the  1971  Census.  The  total  budget  for  the  Programme 
exceeds  $1  million,  two-thirds  of  which  is  absorbed  by  the  six  major  projects: 

1.  Reverse  Record  Check  -  to  measure  undercoverage 

2.  Response  Variance  -  to  measure  overall  reliability 

3.  Agricultural  Quality  Check 

4.  Evaluation  of  Quality  Control  procedures 

5 .  Work  Measurement  Programme 

6.  Official  Observer  Programme 

The  remainder  of  this  paper  deals  only  with  evaluation  projects  aimed  at 
the  first  objective,  the  measurement  of  errors.  The  next  section  covers 
the  measurement  of  coverage  errors;  the  third  section  describes  the  response 
variance  project  which  measures  the  overall  reliability  of  Census  data;  and 
the  fourth  section  covers  the  measurement  and  control  of  processing  errors. 

2 .  THE  MEASUREMENT_OF_COVERAGE_ERRORS 

A  coverage  error  occurs  when  a  person  or  household  that  should  have  been 
enumerated  in  the  Census  is  either  not  enumerated  or  enumerated  more  than 
once.  Coverage  errors  can  therefore  lead  to  errors  in  total  population 
counts  at  all  geographical  levels.  Furthermore,  if  persons  or  households  with 
certain  characteristics  are' more  often  missed  (or  double  counted)  in  the 
Census  than  others,  the  distribution  of  the  population  according  to  these 
characteristics  may  be  affected  (e.g.  if  young  adult  males  are  more  often 


16 


missed  than  other  age-sex  groups,  the  age-sex  distribution  of  the  population 
obtained  from  the  Census  will  be  distorted) .  When  interpreting  Census  data 
it  is  thus  important  to  have  some  idea  of  the  extent  and  incidence  of 
coverage  errors  found  in  the  Census.  In  a  national  Census  undercoverage 
appears  to  be  more  frequent  than  overcoverage.  Undercoverage  will 
occur  wherever 'a  Census  representative  overlooks  or,  for  any  other  reason, 
fails  to  enumerate  a  dwelling,  or  when  a  householder  omits  certain  members 
of  the  household  when  completing  the  questionnaire.  Overcoverage  can  only 
occur  when  a  person  is  enumerated  as  a  permanent  resident  in  two  different 
households.  It  is  quite  common  for  a  person  to  be  enumerated  as  a  permanent 
resident  in  one  dwelling  and  as  a  temporary  resident  in  another  dwelling 
(e.g.  in  a  hotel).  A  system  of  cross-checking  on  a  sample  basis  is  built 
into  Census  procedures  to  ensure  that  adjustments  are  made  for  this  potential 
source  of  double-counting. 

There  are  several  methods  of  estimating  coverage  errors  in  a  Census: 

The  first  method  is  re-enumeration  of  a  random  sample  of  areas  soon 
after  the  Census  using  highly-trained  permanent  staff  to  ensure  a  complete 
count.  This  method  has  two  main  disadvantages: 

a)  however  hard  one  tries,  one  can  never  be  sure  of  having  achieved  complete 
coverage 

b)  real  changes  between  Census  day  and  re-enumeration  will  confound  the 
comparison  of  the  two  enumerations. 

2)  The  second  method  is  the  analytic  demographic  method.  This  uses 

demographic  techniques  to  estimate  the  current  population  by  age  and  sex. 

Such  estimates  are  based  on  previous  Census  figures  and  data  on  births,  deaths, 
immigrants  and  emigrants.  The  weaknesses  of  this  method  are: 
a)  there  are  no  reliable  data  on  emigration  from  Canada 


17 


b)  no  data  on  the  characteristics  of  individual  persons  missed  in  the 
current  Census  can  be  obtained,  (other  than  age-sex) 

3)  The  third  method  is  known  as  the  Reverse  Record  Check  and,  since  it 
is  the  method  that  has  been  used  in  Canada  in  1961,  1966,  and  1971,  it  will 
be  described  in' some  detail.  In  essence  it  involves  : 

(1)  building  up,  from  sources  independent  of  the  1971  Census,  a 
universe  of  persons  who  should  have  been  enumerated  in  the 
1971  Census 

(2)  selecting  a  sample  from  this  universe 

(3)  determining  the  Address  of  each  selected  person  on  Census 
day  1971 

(4)  checking  with  Census  returns  at  the  address  found  to  see  whether 
the  person  was  enumerated 

(5)  collecting  further  information  from  those  persons  found  to  have 
been  missed  in  the  Census 

(6)  scaling  up  the  sample  results  to  give  estimates  of  undercoverage 
and  the  characteristics  of  missed  persons. 

The  universe  is  made  up  of  persons  enumerated  in  the  1966  Census,  birth 
registrations  since  the  1966  Census,  immigrant  registrations  since  the  1966 
Census,  and  persons  missed  in  the  1966  Census.'  Theoretically  these  four 
groups  include  everyone  who  should  have  been  enumerated  in  the  1971  Census. 
(They  will  also  include  some  persons  who  should  not  be  enumerated  in  the 
1971  Census  i.e.  deaths  and  emigrants  before  the  1971  Census).  Random 
samples  were  selected  from  each  of  the  first  three  groups.  A  sample  of  persons 
missed  in  the  1966  Census  was  available  from  the  1966  Reverse  Record  Check. 


18 


local  sample  size  was  27,000  persons.  The  third  stage  of  determining  the 
1971  address  for  each  selected  person  is  the  most  difficult.  About  45% 
of  persons  selected  from  the  1966  Census  were  found  immediately  at  the  same 
address.  For  the  others  a  variety  of  traces  that  has  included  mail-outs, 
telephone  calls,  searches  of  administrative  records,  and  field  follow-up 
by  Regional  Offices  has  so  far  located  95%  of  the  original  sample.  This 
operation  is  still  in  progress.  Once  a  person  is  traced,  the  Census  returns 
are  checked  to  see  if  he  was  enumerated.  If  so,  our  task  is  over.  If  not, 
attempts  are  made  to  confirm  the  address  to  which  the  person  was  traced.  If 
the  address  is  confirmed  the  person  is  classed  as  missed  and  he  is  asked 
to  complete  a  questionnaire  containing  questions  similar  to  those  asked  in 
the  Census.  In  this  way  it  is  hoped  to  collect  and  analyse  the  characteristics 
of  persons  missed  in  the  Census.  As  a  result  of  the  tracing  and  searching 
operations,  every  member  of  the  original  sample  will  be  classified  as  either 
'enumerated',  'missed',  'deceased',  'emigrated',  or  'tracing  failed'  (this 
last  group  containing  persons  for  whom  all  tracing  attempts  were  unsuccess¬ 
ful).  Based  on  this  classification  estimates  of  undercoverage  by  age,  sex, 
region,  and  certain  other  characteristics  will  be  made. 

Since  the  tracing  is  still  in  progress,  no  results  for  1971  are  yet 
available.  Instead,  some  results  from  1966  are  quoted  to  exemplify  the  type 
of  information  that  will  eventually  be  available  for  the  1971  Census. 


19 


TABLE  1 .  %  Underenumeration  by  Age  and  Sex  1966  Census  of  Canada 


%  Underenumeration 

age  Group 

1 

Total 

,  Hale 

Female 

| 

i  0-4 

2.96 

2.30 

3.63 

5-14 

1.53 

1.44 

1.63 

15-19 

3.57 

3.77 

3.35 

20-24 

7176 

9.79 

5.60 

25-39 

2.73 

3.23 

2.22  ! 

40-59 

j 

1.67 

2.27 

1.09 

60+ 

1.97 

1.72 

2.37 

J 

4- 

i 

1  All  Ages 

2.62 

2.90 

i 

2.35 

TABLE  2 .  %  Underenumeration  by  Region 


1966  Census  of  Canada 


Region 


Atlantic 
' Quebec 
Ontario 
Prairies 

British  Columbia 

i - 

I 

CANADA 


%  Underenumeration 


1.98 

2.95 

2.65 

2.24 

2.84 


2.62 


i 

j 

l 


i 

i 


t 

! 

t 


I 


-  20  - 

Results  of  the  1971  Reverse  Record  Check  are  scheduled  to  be  available 
by  the  end  of  this  year.  These  results  strictly  measure  only  undercoverage, 
ihey  do  not  allow  for  any  double-counting  that  might  have  occurred. 

There  is  a  small  evaluation  project  that  is  examining  the  coverage  of  persons 
who  moved  within  two  weeks  of  Census  day.  A  sample  of  such  'movers'  has 
been  selected  and  Census  returns  will  be  checked  to  see  whether  each 
'mover'  was  enumerated  at  just  one  address,  at  both  his  old  and  new  address, 
or  at  neither  address.  Thus  this  project  will  provide  some  data  o n  under- 
coverage  and  over-coverage  for  this  particular  sub-group  of  the  population. 

3 .  THE_MEASUREMENT_OF_RKSPONSE  ERROR 

ibis  section  describes  the  response  variance  project  which  is  designed  to 
provide  measures  of  the  root  mean  square  error  of  published  Census  data. 

Ine  estimates  of  RMSE  made  from  this  study  will  include  sampling  variance 
(for  long-form  characteristics),  response  variance  introduced  at  the  time 
of  enumeration,  coding  variance,  and  processing  error  variance.  The 
figures  for  RMSE  will  not  include  any  allowance  for  undercoverage,  nor 

will  they  include  a  measure  of  any  response  bias  that  may  be  present  in 
the  data. 

Since  response  variance  measures  the  level  of  variation  between  response 
errors,  a  method  for  measuring  it  can  be  built  into  the  Census  using 
certain  statistical  techniques.  However,  the  measurement  of  response 
Dias  is  more  difficult  since  it  requires  knowledge  of  “true  values" 
and  so  some  independent  source  of  data  outside  the  Census  is  needed.  He 
wixl  return  to  the  measurement  of  response  bias  after  the  description  of 
the  response  variance  project. 


21 


Response  Variance 

The  idea  behind  the  response  variance  project  is  to  obtain  two 
independent  estimates  of  the  same  Census  cell  frequency,  and,  by  comparison 
of  these  two  estimates,  to  obtain  an  estimate  of  the  total  variance. 

A  random  sample  of  400  pairs  of  contiguous  enumeration  areas  (F.As)  was 
selected  prior  to  the  1971  Census.  A  Census  representative  (or  enumerator) 
was  assigned  to  each  EA  and  she  performed  the  drop-off  procedures  in  the 
normal  way.  (Drop-off  involved  listing  all  households  in  the  EA  and 
leaving  a  Census  questionnaire  with  each).  After  drop-off  the  list  of 
households  in  each  EA  was  split  into  two  random  halves. The  new  enumerator 
assignments  then  consisted  of  one  half  of  the  enumerator's  original 
assignment  together  with  one  half  of  the  other  enumerator's  original  E.A. 
and  vice  versa.  Each  enumerator  then  proceeded  to  enumerate  her  new 
assignment  in  the  usual  way.  Thus  each  enumerator's  assignment  was 
roughly  the  same  size  (in  terms  of  households)  as  a  regular  EA  but  covered 
twice  the  area. 

Using  a  mathematical  model  for  the  response  errors,  estimates  of  RMSE 
will  be  made  for  various  characteristics  and  group  of  characteristics. 

These  estimates  will  then  be  used  to  draw  up  graphs  or  tables  plotting 
the  RMSE  against  the  cell  size  to  whicu  relate,  ,  for  various  families 
of  characteristics.  The  graphs  or  tables  will  then  be  used  to  estimate 
the  RMSE  for  other  sizes  of  Census  figures.  In  this  way  it  will  be 
possible  to  read  off  a  RMSE  estimate  for  any  Census  figure  of  any  si20. 

The  provisional  group  of  characteristics  for  which  separate  tables  are 
to  be  produced  are  as  follows: 


Demographic  &  Social  Characteristics: 


1.  Basic  short-form  demographic  data  (age,  sex,  marital  status, 
mother  tongue) 

2.  Social  characteristics  (religion,  language,  citizenship, 
immigration,  etc.) 

3.  Fertility 

4.  Migration 

5.  Education 

Economic  Characteristics: 


6. 

Labour  Force 

Status 

7. 

Occupation 

8. 

Industry 

9. 

Weeks  worked 

in  1970 

10. 

Total  income 

11. 

Employment  income 

Housing  Characteristics: 

12.  Short-form  housing  data  (tenure,  type,  number  of  rooms,  etc.) 

13.  Dwelling  characteristics  (e.g.  age,  bedrooms,  fuels,  househol 
facilities) 

14.  Values  and  Rents 

15.  Dwelling  characteristics  cross- tabulations 


2  3--’> 


When  Che  RMSE  figures  become  available  these  groupings  may  be 
revised  so  that  on  .  characteristics  with  similar  levels  of  reliability 
are  grouped  together. 

A  typical  table  is  shown  below: 


Occupation  Characteristics 


Size  of  Census 
Estimate 

I 

i  j 

million  1500,000 

i 

l  -  -  -  -  i  -  -  . 

50,000 

- 

10,000 

1,000 

i  :  i  i  | 

i  i  j  j  I 

i  ! 

500  !l00  150  120  lio  ‘ 

1  i  i  ! 

RMSE 

j  !  ;  i  1 

i  i ! L 

The  figures  for  RMSE  in  these  tables  will  not  take  account  of  undercoverage, 
response  bias,  or  time-lag  bias. 

The  above  tables  are  to  be  published  as  part  of  the  Census  publications 
to  which  they  relate. 

Response  Bias 

As  mentioned  earlier,  the  measurement  of  response  bias  requires  knowledge 
of  a  true  value.  For  certain  characteristics  and  certain  sub-groups  of 
the  population  it  is  possible  to  obtain  "true  values".  Some  evaluation 
projects  are  making  use  of  such  true  values  in  an  attempt  to  estimate 
response  bias.  As  an  example,  a  sample  of  births  was  selected  from 
birth  registrations  in  the  past  five  years  and  these  children  will  be 
located  in  Census  returns  to  check  their  age-reporting. 

Another  check  is  being  made  on  the  reporting  of  date  of  birth,  country 
of  birth  and  ethnic  origin  by  recent  immigrants.  Further  checks  are 
also  being  made  against  recoras  from  other  current  surveys  (e.g.  the 
monthly  Labour  Force  Survey  for  May  1971) .  The  incidence  of  different 


24 


responses  by  the  same  person  in  the  LFS  and  the  Census  is  being 
measured.  Although,  in  this  case,  we  cannot  be  sure  which  response 
is  the  correct  one,  this  study  can  alert  us  to  those  questions  where 
response  errors  are  more  prevalent. 

In  summary,  estimates  of  the  variance  component  of  RUSE  will  be 
available  from  the  response  variance  project.  While  certain  limited 
investigations  of  sources  and  sizes  of  response  biases  are  being  made, 
no  overall  measures  of  these  biases  will  be  available.  However,  in 
designing  the  procedures  for  the  1971  Census  every  effort  was  made  to 
eliminate  possible  sources  of  bias(e.g.  in  the  wording  of  questions, 
in  the  writing  of  coding  instructions,  and  in  the  use  of  quality  control 
procedures) . 


25 


4.  CONTROL  OF  PROCESSING J5RR0RS 

1 .  Introduction 

The  earlier  presentations  have  covered  errors  associated  with  the  census 
and  the  evaluation  and  measurement  of  errors,  and  in  fact,  the  latter  takes 
account  of  the  success  (it  should  net  be  otherwise)  of  statistical  quality 
control  procedures  applied  to  procedures  involved  in  the  collecting  and 
processing  of  the  Census  data. 

Applications  of  statistical  duality  control  to  such  Census  procedures  have 
the  objective  of  limiting  quality  loss  due  to  processing. 

The  1971  Census  was  the  first  Canadian  Census  to  incorporate  a  statistical 
quality  control  program,  which,  apart  from  the  cost  of  any  corrective  action 
entailed  by  the  program,  cost  approximately  one  million  dollars.  The  procedures 
(with  other  census  procedures)  were  tested  in  a  Trial  Census  carried  out  in  1969 
and  certain  modifications  and  improvements  were  made  to  them  as  a  result. 

The  1971  Ouality  control  program  consisted  of  procedures  applied  to 

1.  Printing  of  census  questionnaires. 

2.  The  preparation  at  Head  Office  of  various  census  kits  for  distribution  by  the 
Regional  Offices. 

3.  The  enumeration  assignment  (i.e.  completed  questionnaires)turned  in  by  Census 
representatives  to  Census  Commissioners. 

4.  The  office  coding  of  writ'c  ”  — erases  to  questions  carried  out  in  the 

Regional  Processing  Officer- 

5.  The  microfilming  of  complete  nuestionraires  and  the  transcription  of  the 
information  on  microfilm  to  magnetic  -.spe  by  reading  it  on  FOSDIC  (Film 
Optical  Sensing  Device  for  Input  to  Computers) . 


2]6 


Basically,  a  fixed  budget  was  allowed  for  each  quality  control  application  and 
for  any  re-processing  that  proved  to  be  necessary  following  quality  control 
rejection.  In  each  case,  the  fixed  budget  necessarily  determined  the  quality 
that  could  be  achieved.  All  the  plans  were  based  on  attributes  apart  from  the 
application  to  microfilming  which  was  based  on  variables.  In  the  remainder  of 
the  paper,  process  and  product  control  will  be  briefly  reviewed  and  some 
details  of  a  few  of  the  five  applications  will  be  given. 

2 .  Quality  Control  Approaches:  Process  and  Product  Control 

Three  approaches  can  be  applied  to  control  processing  of  survey  or  census 
data . 

1.  Process  Control. 

2.  Product  Control  (or  Acceptance  Sampling) 

3.  Procedures  combining  both. 

In  process  control,  the  sample  design  is  established  to  measure  process 
quality,  and  decision  rules  for  adjusting  the  process  are  invoked  only  when 
there  is  evidence  that  the  process  is  out  of  statistical  control. 

EXAMPLE  1;  The  output  of  a  production  process  is  a  continuous  series 
of  items  and  the  most  important  characteristic  of  each 
item  can  be  described  by  a  single  measurement,  such  as 
length,  strength,  etc.  If  the  production  process  is 
operating  correctly,  the  measurements  on  the  items  are 
approximately  normally  distributed  with  a  certain  mean 
and  variance.  A  sample  of  five  items  is  drawn  from  the 
process  every  hour  and  measurements  made  pn  each  item. 

Prom  the  results,  it  is  required  to  decide  whether  the 
process  is  operating  correctly  (the  term  "in  control"  is 
used)  or  whether  some  kind  of  corrective  action  needs  to 


be  taken. 


Such  a  process  control  technique  was  used  during  the  microfilming  of  1971 
Census  questionnaires.  Control  variable  measured  is  the  density  of  the  micro¬ 
film.  The  mean  is  determined  for  optimum  readibility  with  the  FOSDIC  machine. 

Product  Control  or  Acceptance  Sampling  establishes  the  sample  design  and 
decision  rules  for  determining  which  work  lots  (batches  or  assignments)  of 
items  are  acceptable  and  which  are  unacceptable.  Any  lots  rejected  may  be 
scrapped,  sold  as  inferior  products,  or  completely  sorted  and  defective  items 
rectified  or  replaced. 

EXAMPLE  2:  Enumerators  hand  in  their  completed  assignments.  A 

criterion  has  been  established  to  identify  whether  a 
questionnaire  is  defective  (incomplete)  or  effective 
(complete) .  The  plan  (based  on  attributes)  was  to  select 
n  questionnaires  from  an  EA  at  random  and  reject  the  EA 
if  the  number  of  defective  questionnaires  found  in  the 
sample  was  greater  than  some  integer,  c,  called  the 
acceptance  number.  Specifically  the  EA  is  returned  for 
further  edit  and  follow-up. 

The  common  features  of  the  above  (and  other  possible)  examples  are  that 
procedures  are  required  by  which  we  can  decide  among  a  small  number  of  possible 
courses  of  action,  and  in  each  case,  the  decision  is  based  on  the  result  of  a 
small  sample  of  items  and  not  normally  on  inspection  of  every  item. 

The  techniques  of  process  control  and  product  control  can  be  combined  into 
one  procedure  and  this  was  so  in  1971  Census  applications  (e.g.  Q.C.  of  Coding). 


28 


3 .  Application  of  Quality  Control  to  Printing  of  Census  Questionnaires 

The  printing  of  Census  questionnaires  was  contracted  out  to  commercial 
printers  with  the  Queen's  Printer  (now  Printing  Operations,  Department  of 
Supply  and  Services)  and  associated  quality  control  staff  (in  Material 
Analysis  and  Control  Division)  responsible  for  ensuring  that  paper,  printing 
and  bindery  specifications  were  met.  However,  because  of  lack  of  experience 
in  printing  documents  designed  to  be  processed  on  automatic  page-turning 
equipment  where  filming  occurred  with  the  resulting  microfilm  to  be  read 
satisfactorily  by  FOSDIC,  it  was  necessary  to  determine  whether  the  finished 
product  did  indeed  meet  the  processing  requirements. 

Consequently  an  acceptance  sampling  plan  was  implemented  by  Statistics 
Canada  on  boxes  of  questionnaires  received  from  Commercial  Printers.  Before 
the  boxes  of  finished  questionnaires  were  released  for  stuffing  (putting  them 
in  envelopes)  or  other  operations,  acceptability  tests  were  carried  out.  These 
involved  taking  samples  of  the  questionnaires  delivered  by  the  contractors  and 
subjecting  them  to  FOSDIC  tests. 

It  was  realized  that  even  if  the  tests  showed  that  the  questionnaires  were 
not  readable  (and  we  hoped  that  these  might  be  isolated  circumstances) ,  the 
contractor  would  not  be  obliged  to  replace  them.  However,  if  we  could  detect 
such  boxes  and  exclude  them  from  their  use  in  the  field,  we  might  save  ourselves 
some  problems  in  the  processing. 

Resources  and  the  time  delays  were  once  again  the  major  problems  in  the 
acceptability  testing.  At  one  point  it  became  evident  that  the  stuffing  operation 
would  have  to  be  stopped,  leaving  the  machines  and  personnel  idle,  if  the  results 
of  the  tests  were  not  available  in  time.  Fortunately,  this  situation  developed 
towards  the  end  of  the  shipping  operation.  In  such  circumstances  the  questionnaires 


29 


were  put  into  envelopes  in  anticipation  of  results  and  were  not  released  before 
results  were  known. 

4 .  Application  of  Quality  Control  to  Enumeration  Assignments  (EA's)  turned  in 

by  Census  Representatives  to  Census  Commissioners 

Quality  control  was  limited  in  scope  and  only  applied  for  each  EA  to  those 
questionnaires  that  had  been  worked  on  by  the  Census  Representatives  and  handed 
into  Census  Commissioners  when  the  CR  had  completed  his  assignment .  Question¬ 
naires  that  were  missing  from  the  assignment,  because  of  refusals  by  householders 
or  other  reasons,  were  not  in  scope. 

There  were  basically  two  sets  of  procedures  for  population  and  housing 
questions . 

1)  Quality  Control  procedures  for  regular  EA's  covering  Forms  2A  &  2B. 

2)  Quality  Control  procedures  for  EA's  designated  as  institutional  and 

collective  EA’s  covering  Forms  2A,  2B  and  Forms  3  (Individual  Census 

Questionnaire) . 

There  were  also  Agriculture  Quality  Control  procedures  for  any  EA  containing 
agriculture  holdings  for  which  Forms  6  were  completed  or  for  which  there  were 
Specified  Farm  Cards  (Form  6A)  and  a  list,  of  Specified  Farms  (Form  6B) .  The  EA 
was  rejected  for  Agriculture  follow-up  if  (i)  for  selection  of  questions  on  the 
Form  6  there  was  not  at  least  an  entry  for  each  given  question  or  there  was  not  corres¬ 
ponding  entry  in  the  VR  (unless  it  was  a  non-resident  operation  for  which  another 

Census  representative  had  completed  a  questionnaire  and  which  was  indicated  in  the 
"Comments"  section  of  the  Form  6)  or  if  (ii)  the  Specified  Farm  Card  was  missing 
or  incomplete  or_  if  Form  6  had  not  been  completed  when  there  was  a  Specified  Farm 


Card. 


30 


4 . 1  Detail  of  Procedures 

The  procedures  for  (1)  above  will  be  outlined  here.  The  average  EA  might  have  150 
questionnaires  of  which  one  third  (50)  would  be  long  questionnaires  (Forms  2B)  A  sample 
of  12  Forms  2B  was  selected  by  an  ROR  Technician  from  each  EA,  and  for  private  households 
only,  other  forms  2B  relating  to  that  household  were  included  in  the  sample.  For  each 
selected  household  a  Form  71  was  completed  by  the  ROR  Technician  inserting  check  marks  (/) 
for  simple  omissions  (where  no  explanation  was  provided)  and  when  special  requirements  were 
not  met,  for  questions  relating  mainly  to  coverage  for  which  the  Census  Representative  shoul 
have  taken  appropriate  action.  From  the  attached  Form  71  (Appendix  1)  you  will  notice  that 
some  questions  are  shaded  as  these  were  designated  vital  questions  for  which  answers  were 
required.  You  will  also  note  that  the  Form  is  divided  into  two  parts.  One  part  covering 
100  per  cent  questions  and  the  other  sample  questions.  A  Type  1  reject  occurred  if  there 
were  one  or  more  check  marks  in  the  shaded  n100  per  cent  questions".  A  Type  2  reject 
occurred  if  either  there  was  one  or  more  check  marks  in  the  shaded  "sample  questions"  o£ 
there  were  6  or  more  check  marks  in  non-shaded  columns  of  the  form. 

To  make  the  following  decisions  about  acceptance  or  rejection  the  following  two 
totals  were  required. 

(a)  Total  number  of  households  with  Type  1  rejections. 

(b)  Total  of  households  with  Type  1  or  Type  2  rejections. 

All  Forms  2B  in  the  EA  were  rejected  for  follow-up  if  the  total  (a)  or  (b)  exceeded 
the  acceptance  number  2.  Similarly  all  Forms  2A  in  the  EA  were  to  be  rejected  for  follow-up 
if  the  total  (a)  exceeded  the  acceptance  number  2. 

If  the  EA  is  rejected  it  goes  back  to  the  original  Census  Representative  for  clean  up 
if  the  total  of  (b)  is  3  or  4  and  to  a  special  clean-up  Census  Representative  if  the  total 
of  (b)  is  5  or  more. 

Thus  it  can  be  said  that  the  details  are  quite  involved.  Note  the  assumption  that  we 
can  base  decisions  on  the  Forms  2A  on  the  100%  questions  which  are  noted  on  the  sample  of 
long  Forms  2B  which  may  or  may  not  be  more  stringent  than  is  actually  required. 


4 . 2  Contingency  Plans 


-  31''- 


Two  contingency  plans  were  developed  for  implementation  by  ROR's  whose 
purposes  were  as  follows: 

(1)  To  limit  the  number  of  rejections  to  clean-up  Census  Representatives 
in  areas  which  have  high  rejection  rates  (by  loosening  the  q.c.  plan). 

(2)  To  speed  up  the  quality  control  operation  (i.e.  work  of  ROR  Technicians) 
if  it  runs  far  behind  schedule  (by  reducing  the  sample  size) 

5.  42?tI^II2N_OFjQ5ALITY_CONTROL_TO_THE_OFFICE_CODING_OF_WRITTEN_RESPONSES_TO 
QUESTIONS  CARRIED  OUT  IN  REGIONAL  PROCESSING  OFFICES. 

—  —  —  —  mm  mm  —————  —  —  —  —  —  —  —  ———————  ————————— 

5 . 1  Details  of  Procedures 

The  system  consists  of  the  following  steps: 

a)  Noting,  i.e.  coding  responses  on  a  questionnaire  facsimile  , for  a  sample  of 
six  Forms  2B  (long  questionnaires)  for  an  EA. 

b)  Coding  of  EA. 

c)  Review  of  coding:  that  is  comparing  the  noting  and  coding. 

d)  Adjudication  for  those  EA’s  rejected  at  review  (too  many  disagreements). 

e)  Re-coding  of  those  EA's  rejected  at  Adjudication*  (too  many  coder  errors) 

f)  Feed  back  and  action  on  coder’s  and  noter’s  errors. 

The  following  is  an  outline  of  the  quality  control  system  which  was  used  for 
the  coding  of  economic  characteristics  in  the  1971  Census.  The  system  for  the 
general  eoding  is  the  same  except  that  there  are  no  referrals  for  general  coding, 
(i)  The  system  may  be  described  simply  as  2-way  independent  sample 

verification  with  dependent  adjudication  of  disagreements.  Adjudication 
was  necessary  because  it  appears  that  noters  make  almost  as  many  errors 
as  the  coders.  Hence  to  avoid  unnecessary  rejection  (due  to  noter 
errors)  a  third  person  decided  who  was  right  or  wrong,  noter  or  coder  or 


both . 


32 


(ii)  The  movement  of  an  EA  through  the  system  is  diagrammed  in  the 
attached  flow  chart  (see  appendix  2).  Noting,  coding  and  review 
were  basically  the  same  as  in  the  1969  Trial  Census. 

(iii)  On  the  basis  of  the  number  of  items  (questions)  coded  and  the 
number  of  disagreements  (including  referral  disagreements)  the 
reviewer  used  Table  1  (see  appendix  3)  to  determine  whether  the 
EA  should  be  accepted  or  rejected  for  adjudication.  If  the  EA 
was  accepted  at  this  stage  it  passes  on  to  Operation  8B  for 
referral  coding  or  to  Operation  9  if  there  were  no  referrals. 

(iv)  If  the  EA  was  rejected  by  the  reviewer  it  went  to  a  referral  coder 
in  Operation  8B  who  adjudicated  the  disagreements  in  the  quality 
control  sample.  On  the  basis  of  the  number  of  items  (questions) 
Cbded  and  the  number  of  coder  errors  (excluding  referral  errors) 
the  referral  coder  used  Table  2  (see  appendix  4)  to  determine 
whether  the  EA  should  be  accepted  or  rejected  for  re-coding.  If 
the  EA  was  accepted  at  this  stage  the  referral  coder  coded  any 
referrals  and  passed  the  EA  on  to  Operation  9. 

(v)  If  the  EA  was  rejected  for  re-coding,  the  feferral  coder  did  the 
re-coding  unless  the  work  load  was  too  great  in  which  case  re-coders 
were  chosen  from  Operation  8A  or  Operation  7. 

(vi)  During  the  adjudication  of  the  0.  C.  sample  the  referral  coder 
completed  2  error  listing  forms  -  one  for  the  coder  and  one  for 
the  noter.  These  error  listings  were  for •  re-training  both  coders 
and  noters  whose  work  is  rejected.  Note  that  it  was  possible  to 
reject  the  noter 's  work  by  counting  his  errors  and  using  Table  2  as 


for  the  coder. 


33 


5 . 2  Expected  reject  rates  at  Review,  Adjudication  and  Overall  Re-coding  Rate 

It  was  anticipated  on  the  basis  of  data  from  1969  Trial  Census  with  a 
different  and  simpler  coding  and  quality  control  system  that  suitable  starting 
Tables  1  and  2  would  be  as  given  below.  Note  that  Quality  Control  Tables  1 
and  2  were  available  for  the  following  percentages  2%  to  30%  by  1%  steps. 

Use  of  the  5%  Table  2  by  itself  would  imply  that  the  average  outgoing  quality 
limit  (AOQL)  would  be  5%,  that  is,  the  number  of  questions  office  coded  in 
error  would  be  expected  to  be  less  than  or  equal  to  5%.  The  situation  is  a 
little  complicated,  in  fact,  by  the  use  of  a  screening  process  at  Review 
followed  by  Adjudication  and  we  will  have  to  determine  what  the  AOQL  is  for 
the  combinations  of  Table  1  and  Table  2  that  were  used  during  Regional  Office 
Processing. 


STARTING  TABLES 


Review  Adjudication 

or  Recoding 

1.  General  Coding  8%  Table  1  (Opn.  5)  5%  Table  2  (Opn.  5) 

2.  Economic  Characteristics  Coding  15%  Table  1  (Opn.  7)  10%  Table  2  (Opn.  8B) 

With  the  above  starting  tables  the  rejections  rates  at  review  and  at 
adjudication  are  given  below.  The  expected  re-coding  rate  is  given 

which  is  obtained  by  multiplying  the  rejection  rate  at  review  by  the 
rejection  rate  at  adjudication  and  dividing  by  a  hundred. 


Relection  Rates 

Relection  Rates 

No.  of  EAs 

at  Review 

at  Adjudication 

Recoded  as 
%  of  all  EAs 

1.  General  Coding 

20%  to  25%  (Opn.  5) 

40%  to  50%  (Opn.  5) 

10% 

2.  Economic 

Characteristics 

25%  (Opn.  7) 

40%  (Opn.  8B) 

10% 

Note  that  these  rejection  rates  were  only  thought  ot  be  rough  approximations 


34 


in  view  of  the  different  situation  and  the  limited  data  available  from  the 
1969  Trial  Census.  The  number  of  EAs  recoded  in  General  Coding  (In  Operation 
5)  was  budgeted  to  be  10  per  cent  of  all  EAs  going  through  Operation  6. 

Similarly,  the  number  of  EAs  recoded  in  Economic  Characteristics  coding 
(Operation  8B)  was  budgeted  to  be  10  per  cent  of  all  EAs  going  through 
Operation  8A.  As  the  Operations  5,  7  and  8B  progressed  the  Tables  1  and 
2  were  to  be  changed  to  keep  recoding  of  either  general  coding  or  economic 
characteristics  coding  between  5%  and  10%  of  all  EAs. 

5.3  Control  and  Contingency  Plans 

On  a  weekly  basis  Regional  Processing  Offices  provided  Head  Office  with  ‘the 
following  information  required  to  monitor  the  operations,  evaluate  rates  and 
implement  changes  in  Tables  1  and  2. 

1)  No.  of  EA’s  accepted  at  Review  Stage  (Table  1) 

2)  No.  of  EA's  rejected  at  Review  Stage  (Table  1) 

3)  No.  of  EA's  accepted  at  Re-coding  Stage  (Table  2) 

4)  No.  of  EA's  rejected  at  Re-coding  Stage  (Table  2) 

5)  Disagreement  rate  for  all  coders  as  obtained  from  Forms  R108.  (The  no.  of  agreements 

and  no.  of  disagreements  was  supplied).  „ 

6)  Ho.  of  EA's  which  had  completed  general  coding  and  economic  characteristics  coding. 

There  was  an f initial  staffing  problem  as  offices  were  not  staffed  up  to 
strength  nor  had  any  allowance  been  made  for  attenuation  of  staff.  However, 
this  problem  was  sorted  out  and  there  was  no  need  to  make  recourse  to  any  of 
the  contingency  plans  involving  such  considerations  as  closing  down  one  or 
other  of  the  quality  control  operations  and  moving  the  staff  to  the  associated 
coding  operation  (i.e.  to  general  coding  or  economic  characteristics  coding). 


There  was  a  change  made  to  maintain  the  number  of  EA's  recoded  within  the 


35 


desired  5%  to  10%.  That  was  to  tighten  the  plans  for  economic  characteristics 
coding  as  the  rejection  rate  fell  below  the  given  range.  Changed  August  1  from 
15%  and  10%  to  12%  and  8%  for  Tables  1  and  2  respectively. 

6  *  APPLICATIONS_OF_CjUALXTTf _C0NTR0L_T0_HEAD_0FFXCE_PR0C EDURES 

Apart  from  Diary  I  and  Diary  II  relatively  little  work  was  put  into  developing 
control  procedures  for  microfilming  and  FOSDIC  operations  and  the  U.S.  Bureau  of 
the  Census  procedures  were  adopted  with  few  modifications.  In  one  particular  area, 
namely,  clerical  review  of  Diary  I  printouts  (and  where  necessary  of  the 
corresponding  questionnaires  in  the  EA,  etc.)  a  quality  check  was  put  into 
effect.  A  formal  quality  control  procedure  was  not  developed  due  to  the  late 
request  for  assistance  and  the  shortage  of  staff  who  could  design  quality  control 
plans  at  that  time. 


36 


Appendix  2 


FLOiO  CHFRT  OF  ECONOMIC  CHS*  ftC.TERl.ST  (CS 

co2>/A/6>  /?a/ Z>  C\ufuty  Control  OPorft/on 


Appendix  j 


TABLE  1  -  TABLEAU  1 
(15%) 


Number  of 
items  coded 

Nombre  de  postes 
codes 

Accept  if  number 
of  items  disagreed 
is  less  than  or  equal  to 

Acceptez  si  le  nombre 
de  postes  qui  ne  concordent 
pas,  est  moins  que  ou  egal  & 

Reject  if  number  of 
items  disagreed  is 
equal  to  or  greater  than 

Rejetez  si  le  nombre  de  postes 
qui  ne  concordent  pas,  est 
egal  k  ou  plus  grand  que 

1-  3 

0 

1 

4-  7 

1 

2 

8-  11 

2 

3 

12-  15 

3' 

4 

16-  19 

4 

5 

20-  23 

5 

6 

24-  28 

6 

7 

29-  33 

7 

8 

34-  37 

8 

9 

38-  42 

9 

10 

43-  46 

10 

11 

47-  51 

11 

12 

52-  56 

12 

13 

57-  61 

13 

14 

62-  66 

14 

15 

67-  71 

15 

16 

72-  77 

16 

17 

78-  82 

17 

18 

83-  87 

18 

19 

88-  92 

19 

20 

93-  97 

20 

21 

98-102 

21 

22 

103-107 

22 

23 

108-113 

23 

24 

114-118 

24 

25 

119-124 

25 

26 

125-129 

26 

27 

130-134 

27 

28 

135-140 

28 

29 

141-145 

29 

30 

146-150 

30 

31 

ABLE  2  -  TABLEAU  2 


r 

(10%) 


Number  of 
items  coded 

Nombre  de  postes 
codes 

Accept  if  number 
of  items  disagreed 
is  less  than  or  equal  to 

Acceptez  si  le  nombre 
de  postes  qt i  ne  concordent 
pas,  est  moins  que  ou  egal  a 

Reject  if  number  of 
items  disagreed  is 
equal  to  or  greater  than 

Rejetez  si  le  nombre  de  postes 
qui  ne  concordent  pas,  est 
egal  k  ou  plus  grand  que 

1-  5 

0 

1 

6-  11 

1 

2 

12-  16 

17-  22 

3 

4 

23-  28 

a 

5 

29-  35 

5 

6 

36-  42 

6 

7 

43-  48 

7 

8 

49-  55 

8 

9 

56-  62 

9 

10 

63-  69 

10 

11 

70-  77 

11 

12 

78-  84 

12 

13 

85-  91 

13 

14 

92-  98 

1 

15 

99-106 

16 

107-114 

1  C- 

17 

115-122 

?  7 

18 

123-129 

x  •  ] 

19 

130-136 

X  'J 

20 

137-144 

20 

21 

145-150 

J 

22 

I 


l  CENSuS_OF  CANADA 

-CEN5EMENT  DU  CANADA  DE  1771 


73 


DOMINION  BUREAU^  OF  STATISTICS 
BUREAU  P&D&RAL  DE  LA  STATISTIQUE 


Appendix  1 


e 71 


FIELD  QUALITY  CONTROL  RECORD 
DOSSIER  DU  CONTR0LE  QUALITATIF  SUR  PLACE 


100  per  cent  qjestions 
uest  ions  s'  air qs semi  a  t  ute  la  population 


pi  ■•sonne 


Sample  questions  _  Questions-echantillon 


Housing  —  Logement 


Population 


Housing  —  Logement 


Ill  |  H2  113  H4  H5  H6  H7  H8  H9 


Person 

Personae 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 

21 

22 

23 

24 

25 

26 

27 

2S 

29 

30 

31 

32 

33 

34 

35 

36 

37 

38 

39 

40 

Household  summary 
Sommaire  du  manage 

First 

Premii  r 
controle 

Pcuxr.m, 

100  per  cent  vital 
questions 

Questions  essentielles 
s' adressant  a  loute 
la  population 

Sample  vital 
questions 

Questions-ecbantil  ’on 
essentielles 

6  or  more 

non-vital  questions 

6  questions  non 
essentielles  ou  plus 

Household  result  -  Resultal  du  manage 

First 

review 

Premier 

controle 

Type  I  Type  2 

Accepted  reject  reject 

Accepte  Re/et,  Rejcl, 

type  1  type  2 

Second 

review 

Dauxieme 

controle 

Type  1  Type  2 

Accepted  reject  rejo  t 

Accepte  Rcjel.  Rcjvl, 

type  1  type  2 

Date  of  first  review  -  Date  du  premier  controle 

Quality  Control  Technician  -  Technician  du  cor. 

quail  tali! 


Date  of  second  review  -  Date  du  dauxieme  control i 


Quality  Control  Technician  —  Technician  du  c 
qualitali/ 


39 


IMPROVED  CENSUS  METHODOLOGY  AND  ITS  IMPACT  ON  QUALITY  OF  DATA 


Shashi  Sharma 


40 


IMPROVED  CENSUS  METHODOLOGY  AND  ITS  IMPACT  ON  QUALITY  OF  DATA 

The  1971  Census  of  Canada  will  benefit  from  a  more  comprehensive 
advance  testing  and  evaluation  program  than  any  previous  census. 

The  purpose  of  the  testing  program  has  been  to  assess  the  feasibility 
of  a  variety  of  proposals  designed  to  improve  the  quality  of  the 
census  operation.  Quality  in  this  context  should  not  be  inter¬ 
preted  only  in  terms  of  data  reliability,  but  in  the  broader 
sense  of  the  ability  of  the  statistical  agency  to  satisfy  user 
demand  for  census  information.  This  latter  interpretation  of 
quality  includes,  inter  alia,  data  reliability,  data  timeliness, 
data  availability,  and  flexibility  of  data  retrieval,  as  well  as 
the  inevitable  consideration  of  economy,  or  the  cost  at  which 
data  can  be  provided.  An  outline  of  some  of  the  more  important 
innovations  introduced  for  the  1971  Census  is  presented  below. 

CENSUS  TESTING  PROGRAM 

The  first  important  factor  is  the  testing  program  itself  which 
has  been  more  rigorous  and  comprehensive  than  that  associated 
with  any  previous  census.  The  program  has  been  designed  with 
both  content  and  methodology  in  mind.  To  date,  on  the  methodological 
side,  the  program  has  included  an  evaluation  of  self-enumeration 
(with  appropriate  follow-up),  the  use  of  postal  delivery  of 
questionnaires  and  their  return  through  the  mail,  the  operational 
feasibility  of  sampling  at  the  field  collection  stage,  centralized 
and  decentralized  office  procedures,  automated  document  reading, 


41 


and  various  innovations  in  data  access  and  retrieval.  Content 
has  been  included  in  all  tests  in  the  sense  that  "real"  rather 
than  mock  data  have  been  involved.  More  specifically,  content 
has  been  thoroughly  evaluated  in  a  test  of  alternative  wording 
and  format  of  new  and  traditional  census  topics.  As  a  result 
of  the  latter  test,  a  questionnaire  has  been  developed  which 
should  reflect,  more  closely  than  ever  before,  the  conceptual 
basis  of  census  topics. 

SELF-ENUMERATION 

It  has  been  noted  already  that  enumerators  can  be  a  major  source 
of  response  error.  Accordingly,  a  determined  effort  has  been 
made  to  develop  the  requisite  field  methodology  which  would  reduce 
the  role  of  enumerators.  This  methodology,  self-enumeration  with 
follow-up,  involve  respondents  in  completing  their  own  questionnaires. 
An  appropriate  follow-up  telephone  call  or  personal  visit  from  an 
enumerator  is  made  in  the  case  where  a  respondent  did  not  complete 
the  questionnaire  or  did  not  complete  it  fully  and  consistently. 

The  experience  of  the  1960  Census  of  the  United  States,  the 
subsequent  American  testing  program,  and  the  London  Census  Test  of 
1967  in  Canada,  demonstrated  that,  in  general,  respondents  are 
capable  of  completing  a  rather  long  and  complicated  questionnaire 
without  the  assistance  of  an  enumerator.  Approximately  85  per 
cent  of  the  questionnaires  in  London  containing  the  full  range 
of  census  questions  were  completed  and  returned  by  the  respondent. 


42 


Of  these,  some  50  per  cent  were  satisfactory  for  subsequent 
processing.  The  remainder  has  to  be  followed  up,  mostly  by 
telephone,  to  resolve  minor  inconsistencies  or  to  collect  missing 
information.  The  success  of  the  London  experiment  has  encouraged 
the  Census  Division  of  Statistics  Canada  to  propose  that  self¬ 
enumeration  be  used  in  the  1971  Census  in  all  areas  except  parts 
of  the  Yukon,  Northwest  Territories,  and  certain  other  remote 
places  where  enumerator  follow-up  is  impractical  because  of  the 
large  distances  between  dwelling  units. 

The  actual  methodology  of  self-enumeration  with  follow-up  that 
is  proposed  for  the  1971  Census  is  too  lengthy  to  describe  in  this 
presentation.  What  is  relevant  here  are  the  reasons  why  the 
proposed  methodology  is  expected  to  represent  a  substantial 
improvement:  (i)  as  mentioned  above,  the  enumerators'  contribution 

to  the  response  errors  will  be  substantially  diminished  since  the 
direct  role  of  enumerators  is  greatly  reduced;  (ii)  each  adult 
member  of  the  household  is  able  to  answer  the  census  questions 
himself.  In  previous  censuses,  very  often  the  housewife  had  to 
provide  all  the  complex  and  detailed  information  to  the  enumerator 
concerning  all  members  of  her  household;  (iii)  respondents  are 
able  to  consult  their  records;  (iv)  the  enumerator  can  review  the 
census  returns  and  collect  the  missing  information. 


43 


CONTENT 

The  content  of  the  1971  Census  were  Improved  In  two  ways. 

First,  there  were  more  topics  of  enquiry  than  in  previous  censuses. 
The  1971  Census  contained  ninety  questions,  including  parts  of 
questions,  compared  with  seventy  in  1961.  This  expansion  of 
content  had  been  established  only  after  an  extended  program  of 
user  liaison  designed  to  assess  the  relative  merits  of  the  variety 
of  potential  topics  suggested  for  inclusion  in  the  census 
questionnaire.  Second,  the  reliability  of  the  information  has 
been  enhanced  as  a  result  of  the  special  test  of  content  mentioned 
earlier,  as  well  as  by  the  proposed  adoption  of  self-enumeration 
techniques . 

TRAINING 

The  census  is  by  far  the  largest  statistical  survey  carried  out 
in  Canada.  It  is  also  an  infrequent  survey,  and  one  in  which 
content  and  methodology  are  never  static.  Under  these  circum¬ 
stances,  there  is  a  paramount  need  for  an  effective  training 
program  at  all  stages  of  the  operation.  Since  1961  the  Census 
Division  has  established  a  Training  Section  composed  of  qualified 
personnel  whose  task  is  to  ensure  that  concepts  and  procedures 
are  thoroughly  understood  by  temporary  census  staff.  In  1971 
training  is  being  effected  at  all  stages  of  the  operation  by  means 
of  specially  produced  training  brochures,  rather  than  by  means  of 
the  basic  procedural  manuals  as  in  1961.  In  addition,  the  time 


44 


devoted  to  training  classes  and  instruction  periods  has  been 
substantially  increased,  which  should  have  a  notable  impact  on 
the  quality  of  census  data. 

COMPUTER  FACILITIES 

Census  material  in  1971  will  be  processed  through  third-generation 
electronic  data  processing  equipment.  This  should  permit  a 
noticeable  improvement  in  the  speed  at  which  data  can  be  made  avail¬ 
able  compared  with  release  times  in  the  1961  and  1966  Censuses 
which  utilized  f irs t- generation  equipment. 

An  Improvement  will  also  be  registered  at  the  computer  input  stage. 
Automatic  document  reading  was  introduced  in  1951,  but  the  extent 
of  pre-office  coding  was  much  more  limited  than  can  be  expected 
in  1971.  Wherever  possible,  the  questionnaire  is  designed  to 
facilitate  the  self-coding  of  responses  by  respondents  for  direct 
input  into  the  document  reader.  The  reduction  of  the  extent  of 
office  coding  for  many  of  the  questions  reduces  the  propensity  for 
error,  and  thereby  contributes  directly  to  the  quality  of  the 
census  operation. 

Improvements  expected  are  not  confined  to  those  resulting  from 
hardware  developments.  Greater  use  of  the  computer  will  be  made 
at  the  editing  and  imputation  stage  in  order  to  get  a  "cleaner" 
final  document.  In  addition,  significant  improvements  can  be 
expected  at  the  access  and  retrieval  stages.  Simplified  programming 


45 


techniques  are  being  developed  to  facilitate  the  access  by 
professional  subject-matter  statisticians  to  tape  and  disc- 
stored  data.  And,  finally,  a  computerized  geocoding  system  is 
being  developed  which  will  permit  users  to  specify  precise  areal 
boundaries  for  special  requests. 

All  of  the  above  changes  in  census  methodology  will  have  some 
favourable  impact  on  the  quality  of  census ( material .  In  most 
cases,  the  effect  will  be  to  enhance  quality  in  the  usually 
accepted  sense,  i.e.,  the  reliability  of  the  end  product.  In 
other  cases,  quality  will  be  improved  as  a  result  of  increased 
timeliness  as,  for  example,  by  the  use  of  third-generation  computer 
hardware.  In  still  other  cases,  quality  will  be  improved  as  a 
result  of  greater  flexibility  as,  for  example,  by  the  introduction 
of  computerized  geocoding.  Thus,  the  introduction  of  sampling 
must  be  viewed  in  the  total  context  of  all  the  changes  which  has 
been  introduced  in  the  census  operation  in  1971  and  which  have  been 
designed  to  improve  the  reliability  and  timeliness  of  the  data. 

We  may  now  turn  to  the  question  of  sampling,  to  examine  its  impact 
on  the  reliability,  timeliness,  and  cost  of  census  data. 


46 


APPENDICES 


iYe-Tust 


-  47  - 


O  U  v  i  •■>  J 


*1  C./-A 

j. /  ^  j 


nafn 


November  -  December,  i960 


Area:  Ottawa,  Ontario 
Terms  of  Reference: 

Using  the  first  draft  of  the  questionnaire  being  prepared  for  the 

/ 

1967  Census  Test  in  London,  the  test  was  to  facilitate  the  evaluation 
of  the  validity  of  the  questionnaire  as  a  data-gathering  .instrument 3 

i)  respondent's  understanding  of  question  as  they  are 
intended 

ii)  their  ability  to  complete  the  questionnaire 

iii)  adequacy  and  meaningful  ness  of  categories  within  the 
questionnaire „ 


2h9  questionnaires  were  too  incomplete  to  be  processed, 
bX$  questionnaires  were  processed., 


C  0 


Exact  figures  on  cost  are  not  available  as  it  was  included 
as  part  of  the  66-67  programme  <, 


O  *.■*  -fc.  X  — — *  •  -  • 


ll;  enumerators  (Tost  involved  t  Census  Tracts  within  Ottawa). 


4'8 


Li  oncl  on  0  on  o  us  ,l  e  s  t 


196? 


JU  I 


September  12,  196? 
London,  Ontario 


bonus  of  Reference : 

To  tost  and  evaluate: 

1)  the. feasibility  of  using  self -enumeration  in  Canada 

1 

2)  the  ability  to  develop  a  system  for  preparing  and  u tilde 
ing  an  up-to-date  register  of  household  addresses 

3)  M a il- out /Vi ail-back  procedures  developed  through  ;> 
operation  with  the  Post  Office 

U)  the  development  of  processing  procedures  including 

hiring  and  training  a  local  staff  for  a  central  office 

5)  field  follcv-up  procedures  for  households  failing  to 
return  questionnaires 

6)  telephone  and/or  field  follow-up  procedures  for  those 
households  whose  questionnaires  did  net  meet 
acceptability  standards. 


esuaoiisned 


;ults : 


h2  6% 


required 
no  follow-up 


— 0/J 


17. 


u.i 


49 


central  office  operation  with  locally  hired  and  trained  staff  io 
feasible  provided  trained  Head  Office  personnel  can  provide  technical 


supervision* 


In  London  the  ratio  of  long  to  short  questionnaires  was  1  to 
return  rate  was  about  the  sane  for  both# 


Cost:  (328,  COO 


«•  •  •  S#' 


unusually  high  cost  was  due  to  the  fact  that  the  processing  office 
w_s  located  in  London. 


U-»  W>  u—  JL  — 


*  cr  « 

•O  1 


10  special  place  enumerators  1.75/hr.  ->  mileage  (513) 
110  enumerators  -  .?0  per  questionnaire  *  $10.00  per  day 

tra ini ng 
loO  clerics 

30  supervisors  (includes  12  enumeration  suoervi 
$60  office  allowance  +  $u0  mileage  allowance) 

129  letter  carriers 


pe  rv  io  ore-  o7  ^  ■ 


Head  Office  Personnel 

2-  -  administrative 
2  -  technical 

1  -  regional  office  person  (field  supervisor). 


o* 


50 


The  Toronto  'Area. Test 


1963 


Area 


May  21 3  1963 

Toronto^  Ontario  (Two  samples  of  about  3200  household-  at  oh) 


of  Holer arcs: 


1,  A  test  of  two  versions  cf  the  -long  questior.net: 


.  a  cr. '0.7.10 


ailT  ering  in  format  an  a  content  particularly  la. 
characteristics 

2.  Gather  response  data 
3»  Tost  central  office  processing 

[> .  Gather  additional  data  on  cost  estimates  and  pav  strut', 
ure3 . 


3 .  Realistic  projections  of  time  expenditures  anc 
and  type  of  personnel  required 
6.  To  ovaluate  methods  end  procedures . 


’  ~  p  q r> y% 


vSS  VLi.  v/3  • 


Tna  response  rat9  was  just  eve r  3^-, 

J.t  had  been  impossible  to  develop  effective  publicity  where  only  a 
13  -ample  was  involved « 

Mast cedent  resented  being  chosen, 

103  °f  the  questionnaires  mailed  out  did  not  reach  their  destination. 
Control  office  processing  is  feasible  provided  locally  hired  end  trained 
staff  ore  closely  directed  by  DES  supervisors „ 


r\  _  .. 


O  '-  ^ 
u  vci 


t:  $35,707  (up  to  HoC0  Processing) 

J- -1-  J-* ‘o  • 

20  enumerators  $l*9Q/hr*  »10  mile 

' "30  supervisors.,  stenographers  2o£0/hru 


clerics  1.60/hr « 


5  2 


Rural  Area  (October)  Test 


-»  r  ,  r 

J.yOj 


Laac 


1  --  W  o  ' 


October  22,  1963 
Annapolis j  Nova  Scotia 
Napiervillo,  Quebec 
Durham,  Ontario 
Lethbridge ,  Alberta 


■'  s  *  ) 

J-lx/ 


ricu-uure 


Toms  of  Reference : 

To  test  and  evaluate: 

1)  French -and  English, ‘Population  and  Housing ,  and  Ag 
questionnaire  ’design  and  content 

2)  Response  rates  in  rural  areas 

3)  To  evaluate  combined  census  of  Population  3nd  Hour 
and  Agriculture 

li)  Realistic  projections  of  time  expenditure  and  the 
and  type  cf ■ personnel  required.  Specifically,  rax 
pay  and  cost  estimates  in  rural,  areas 

5)  Procedures  designed  to  minimise  sampling  selection 

Rote:  The  procedure  employed : 

i)  long  (sample)  questionnaire  -  dr  dp-off /pick-up; 
ii)  short  questionnaire  -  interview  method; 

agriculture  questionnaire  -  interview  method  except  whu-e  _ 
pro-mailed  questionnaire  to  operators  'of  large  holdings  had 
been  completed . 


OldO  w 


53 


6^7/j  completed  or  .partially  completed  by  the  respond 
2 y/j  fully  acceptable 

77 %  completed  in  whole  or  in  part  by  enumerators 


:rr  o: 


1 h%  completed  by  respondent 

26$  completed  in  whole  a'  in  part  by  enumerators , 

rhe  test  lacked  effective  publicity* 

.2  .spcndonts  picked  on  the  word  :itsst,!* 

Cost:  $57^318  (Field  only) 


'  *  *■  t  '*  h  rt  • 


pm?  i  < 


j.:..rn.o  si  oners 


.  -  '  'on  'c*-  »>*c 
**•  Kj  o 


.Annapolis  Napiervillo  Durham  Lethbridge 


20 


21* 


21 


13 


Commissioners :  $1000  *>  $200  office  allowance  *12  v. 

Inumeratsr  a :  piece  rate  *  $7  per  half  lay  of  train!; 
Dpecial  Enumerators:  $1  .?>/hour 


54 


CenSU; 


fate:  Sept sr-iber  30-,  19&9 
free :  Sherbrooke^  Quebec 

St.  Catharines Ontario 
Souris  5  Manitoba 


barms  of  Reference: 

In  preparation  for  the  19?1  Census., 

1)  To  rehearse  fields  office  sr.d  computer  procedures 

2)  test  self -enumeration  questionnaire  content  and  assign 

3)  gather  data  aimed  at  finalising  rates  of  027  for  .  dels 
and  office  staff 

1»)  alien.-;  accurate  forecasts  of  field  time  and  staff  jqud. 
r.ents 

3)  provide  data  for  cost  comparison  between  alternative 
methods  cf  enumeration. 


Questionnaires  returned  by  mail: 

St.  Catharines  -  51^ 

Sherbrooke  -  81% 

(long  and  short  returned  in  equal  proportions) 
Response  Rates: 


Total 

Respondent 

Enumerator 

Fail  or  Blau 

form  23 

93*6 

79  or 

13.0 

.-V 

9 

vO 

form  2 A 

93.3 

£6,3 

12*0 

1.2 

(baaed  on 

question 

by  question 

Study) 

55 


.  35/v  completed  by  respondent 

completed  in  part  by  enumerate: 
12^  c caplet ed  by  enumerator * 


\»/  u*  O 


*>75/7^92? 


c/-  f  r  *  n  v  * 


Sherbrooke 

Q  •  -  C%  r'.  *-  %  ^  •»*  T“’  o 

u;  ■-/  us— 

Souris 

.  1*  Co  C  — 

ppD  7  r* 

01 

01 

•Vn 

03 

P.C1  Clerks 

03 

up 

01 

07 

Commissi oners 

10 

12 

01 

0  m 

* — > 

J - sO*  A  Cl  t  \Jj.  O 

219 

23o 

22 

5  o  r) 

4  /  i 

IpQ  letter  carriers 
C  cram  is  3  i  one  r  s  $16  60 


Znunerat  era 


$?  oCO  per  hall  day  of 


allowance 
tra ining „ 


o 


Ji.-QO  rate  Tor  drop  off  and  2 •00/hour 
111  Clerics  and  Clean-up  enumerators  - 


i'er  mail-back  and  pick-up. 
2  oC/nour  +  mileage- 


5  6 


jR  SBL?-HNUMEiUTlON  lid 


19 


71  C  IN  S  ul 


.cl.  Parson  has  an  Opportunity  to  report  for  himself. 

:  :: -races  Bias  and  Variance  in  Enumeration  iron  Enumerators''’ 


.ao.. 


reduces  the  amount  of  tine  required  to  take  a  Census 


:  provides  tine  to  record  considered  and  nor 
:c -usary  by  consulting  relevant  documents. 


S—  U..O 


It  nay  yield  more  answers  to  questions  of  a  personal  or 


It  gives  added  emphasis  to  confidentiality  in  nail-back  are— . 

Pc  creates  greater  public  interest  and  sense  of  rasponsibiliry  for 
the  success  of  the  Census. 


57 


•101?  .71  "1  :£  LIMITATIONS  OF  SEI.F-E&LB1E  RATION  BEEN  OVE.ICOMI 


i.  Mon- response  is  dealt  with  by  a  follow-up  procedure, 


a  araagn  aorarc 


and  the  o revision  o. 


The  questions  are  suff icienfly  simple  and 
t/i  ah  the  help  of  the  Instruction  Booklet 
assistance  service  will  help  to  clear  up  many  mi sunder 
questionnaire  and  field  procedures  have  been  testae. 


The  answers  to  the  Census  questionnaire  ere  checked  in  the  ii 


L  J.uO».u  w  O  Tl  v-I 


TJ 


«  are  applying  quality  control  techniques  to 


whether  questionnaires  in  an  HA  have  been  compact _d  well  arcs 
the  HA  should  be  accepted  as  it  is  or  rejected  for  further  fo 
Hi  tailed  checks  are  made  with  the  aid  of  the  computer  curing 

t  reasonably  be  carried  out  in  ih 


The  detailed  checks  could  re 
could  they  be  carried  out  consistently  in 
r. eke  adjustments  for  inconsistencies. 


me  cci 


58 


SAMPLING  AND  ITS  IMPACT  ON  RELIABILITY  OF  CENSUS  DATA 


Gordon  Brackstone 


59 


SAMPLING  AND  ITS  IMPACT  ON  RELIADILITY  OF  CENSUS  DATA 


In  the  1971  Census , sampling  was  used  for  private  dwellings  in  self-enumeration 
areas  (which  contain  97%  of  the  population) .  A  short-questionnaire  (Form  2A) 
contained  basic  demographic  and  housing  questions,  while  the  long-questionnaire 
(Form  2D)  contained  the  same  basic  questions  plus  extra  questions  on  topics 
that  included  housing  conditions,  demographic  items,  education,  migration,  and 
economic  characteristics.  In  collective  dwellings  (institutions,  hotels,  etc.) 
and  in  canvasser  areas  (generally  remote  areas  or  areas  presenting  special 
enumeration  problems)  the  whole  population  was  enumerated  on  Forms  2D. 

The  main  factors  that  were  considered  prior  to  the  decision  to  introduce 
sampling  into  the  3971  Census  were: 

(i)  Effects  on  cost 

(ii)  Effects  on  timeliness  of  publication 

(iii)  Response  burden  on  the  public 

(iv)  Effects  on  reliability  of  Census  data 

The  effect  of  sampling  on  costs  is  to  introduce  considerable  savings  in  many 
areas  of  Census  collection  and  processing.  These  areas  include  follow-up  of 
incomplete  questionnaires,  coding  of  long  form  questions,  and  document 
reading.  The  savings  achieved  by  taking  a  1/3  sample  as  compared  to  100% 
long,- form  enumeration  have  been  estimated  at  more  than  $6  million,  or  16% 
of  ihe  Census  budget  for  1971.  There  are  some  extra  costs  involved  with 
sampling  (e.g.  slightly  more  complex  administrative  and  control  procedures; 
and  the  need  to  develop  a  weighting  system  to  scale  up  sample  figures  to  the 
population  level),  but,  in  a  Census  as  large  as  that  of  1971,  these  are 
negligible  compared  with  the  savings. 


The  Introduction  of  sampling,  by  reducing  the  number  of  long-forms  that  need  to 
he  processed,  also  reduces  the  time-lag  between  Census  day  and  the  final 
publication  of  cross  classification  tabulations  involving  long-form  data.  The 
greatest  savings  of  time  are  realized  in  coding  operations  and  computer  processing. 

Asking  every  househoid  to  answer  a  long  questionnaire  would  not  only  put  a 
burden  on  the  Census  collection  and  processing  operations  but  would  also  put 
an  unnecessary  burden  on  the  public.  liy  sampling  at  the  rate  of  1  in  3,  the 
average  number  of  questions  that  had  to  be  answered  per  person  or  per  household 
has  been  greatly  reduced. 

The  above  three  considerations  all  support  the  use  of  sampling.  These  have 
to  be  weighed  against  considerations  of  reliability.  The  measure  of  reliability 
considered  is  the  RMSE  which  contains  components  due  to  sampling  variance, 
response  variance,  response  bias  and  time-lag  bias. 

both  the  sampling  variance  and  the  response  variance  decrease  as  the  sample 
size  increases.  Sampling  variance  becomes  zero  with  a  100%  sample,  but  response 
variance  is  still  present.  Tlius,  even  without  sampling,  there  will  be  a 
positive  RflSb  attached  to  each  Census  estimate.  The  Coefficient  of  Variation 
(C V)  will  be  larger  for  estimates  based  on  fewer  sample  cases  (i.e.  for  small 
areas)  than  for  estimates  based  on  a  large  number  of  sample  cases.  Furthermore, 
for  a  given  areas  and  a  given  number  of  persons  or  households  in  the  sample, 
the  CV  will  be  larger  for  rare  cells  (i.e.  cells  containing  few  persons)  than 
for  common  cells.  Thus  the  CV  of  an  estimate  depends  primarily  on  the  size 
of  the  estimate,  and,  to  a  lesser  extent,  on  the  size  of  the  area  to  which  it 
refers.  Table  A  provides  an  indication  of  the  expected  magnitude  of  the 
RMSE  for  1971  Census  data.  Columns  9  and  10  are  the  columns  that  are  relevant 


-  61- 


to  the  1971  Census.  From  this  table  one  can  see  that  the  effect  of  sampling 
(compare  column  9  with  column  5)  is  small  for  large  estimates  (500  or  more), 
but  can  be  quite  large  for  small  estimates  (50  or  less).  However,  for  small 
estimates,  the  data  is  not  very  reliable  even  without  sampling.  This  table 
is  based  on  an  analytical  formula  for  RMSH  which  was  itself  based  on 
empirical  experience.  One  of  the  assumptions  that  was  derived  from  some 
empirical  studies,  and  which  was  incorporated  into  the  analystical  formula 
for  RMSH,  implied  that  the  RMSH  for  a  33  1/3%  self-enumeration  sample  coincided 
with  the  RMSH  for  a  100%  canvasser  Census.  Hence  the  exact  agreement  between 
columns  3  and  9. 

This  table  also  makes  an  assumption,  based  on  experience  in  Canada  and  the  U.S., 
that  the  response  bias  is  equal  to  6%  of  the  estimate.  While  this  is  probably 
close  to  the  truth  for  characteristics  such  as  income,  occupation , indus try , 
and  employment  status,  it  may  be  an  overstatement  for  other  more  straightforward 
characteristics.  Table  A  takes  no  account  of  time-lag  bias. 

Table  A  is  intended  to  provide  a  general  picture  of  the  comparative  levels  of 
reliability  for  different  cell  sizes.  Since  some  characteristics  are  more 
prone  to  response  errors  than  others,  some  cells  will  have  higher  RMSH* s, and 
some  lower  RMSH's,  than  shown  here.  estimates  for  individual  characteristics 
or  groups  of  characteristics  will  be  provided  by  the  response  variance  study 
and  will  reflecL  the  different  levels  of  response  variance  inherent  in 
different  types  of  characteristics. 


H 
►  s 

rs 

C/3 

iU 

H 

3 

Cl 

H 

CO 

IO 

ns 

o 

2 

1U 

§ 

ro 

cr 

co 

3 

i.) 

a: 

t- 

o 

o 

Ou 


1 

62 

1  & 

o 

Us 

(/) 

1  1/1 

E 

1  3 

o 

1  c/1 

VJ-4 

’  X 

* 

UJ 

w> 

o 

c 

1 

o 

o 

H 

f 

l 

*  o 

to 

u'n 

CN 

A 

r—i 

ON 

00 

A 

in 

O' 

On 

00 

o. 

►J 

c 

> 

• — i 

o0 

A 

A 

OC 

A 

A 

*H 

00 

A 

A 

rH 

<_> 

w 

UJ 

> 

f  x 

-H 

or 

u 

X 

*  x: 

0) 

n 

UJ 

/-s 

sf 

OO 

ST 

ST 

m 

lO 

A 

ON 

A 

O' 

o 

n 

CN 

CO 

rH 

rH 

sr 

A 

o 

iT 

ON 

A 

A 

<T 

ON 

A 

oo 

1  ^ 

V) 

pi 

rH 

H 

rH 

rH 

rH 

►  J 

d) 

£ 

oc 

'  o 

u 

c< 

c 

oc 

A 

rH 

00 

n- 

A 

A 

<T 

n. 

sr 

ST 

r>. 

< 

cu 

> 

o 

m 

rH 

CO 

A 

rH 

CO 

A 

rH 

I  ,J 

u 

o 

rH 

cu 

u 

x 

v-' 

°h 

■ 

OJ 

CO 

a. 

\ 

i 

H 

OC 

nO 

CN 

oo 

A 

o 

NO 

On 

NO 

O 

nO 

H 

4-# 

UJ 

/-N 

»  3 

c 

cn  \ 

<o 

ON 

A 

ON 

A 

OO 

rH 

A 

A 

CO 

CM 

NO 

s 

<u 

a 

a; 

V-/ 

rH 

rH 

«H 

rH 

H 

Vs 

tt! 

1 

cu 

to 

u 

U-4 

»H 

-H 

\  H 

-o 

x 

/*N 

CN 

A 

A 

A 

00 

A 

A 

00 

o 

sT- 

CO 

> 

oo 

NO 

A 

rH 

sO 

A 

rH 

rH 

NO 

A 

A 

rH 

IN 

Vs 

u 

V-/ 

o; 

O 

u 

4-4 

x 

H 

o 

O 

c 

A 

1  ^ 

o 

UJ 

•H 

A 

A 

rH 

00 

A 

A 

rH 

O' 

o 

CC 

*H 

CO 

1  <, 

4J 

V-/ 

A 

nO 

A 

A 

M3 

ON 

rH 

A 

A 

ON 

iO 

nJ 

cd 

»H 

»H 

o 

Vs 

cu 

no 

o 

Is, 

c 

X 

c-N 

i? 

CO 

A 

m 

O 

ON 

A 

o 

m 

cH 

UJ 

> 

nC 

ST 

H 

rH 

nr 

A 

rH 

-sr 

A 

rH 

rH 

Oi 

1 

u 

o 

4-1 

H 

H 

X 

o 

CU 

O 

< 

CO 

o 

A 

sD 

A 

A 

ON 

A 

rO 

A 

rH 

<n 

oo 

u. 

•H 

UJ 

CO 

A 

A 

sr 

A 

A 

<T 

NO 

ON 

A 

A 

A 

o 

£ 

£ 

v-' 

rH 

H 

>  H 

s““\ 

:d 

> 

ST 

A 

rH 

OO 

A 

CN 

MT 

A 

ST 

n* 

o 

o 

n* 

A 

rH 

A 

A 

rH 

A 

A 

A 

rH 

e 


>» 

X) 

c 

o 


100%  long 
enuraeratii 

canvasser 

UJ 

A 

£ 

CO 

V-/ 

A\OA  CO  A  o  N O  On  X  O  X 

A  A  ON  AOOjHA  AOOAnO 
rH  rH  H  rH 

X 

>> 

u 

O 

or. 

<*-<  X  -H 

rJ 

o 

O  u  u 

CU 

►H 

•H  Ml 

o 

H 

V  C  i  d 

cr 

•5 

0>  *H  V-I 

OJ 

/~S 

A  A  O  A  A  O  O  A  A  O  O 

c2 

A  UJ  CU 

Vs 

CN 

A  A  A  A  O  A  A  O 

UJ 

n  tn  o  u 

14H 

V-/ 

rH  rH 

n 

13  c  vs  u 

If  O  H  If 

rH 

CO 

in  vs 

* — < 

1 

:*J  V'. 

0)  Vs  OJ  CO 

o 

X  o 

C  CU  x  X 

o 

!-s  t_> 

H  CL  iJ  O 

x  re: 

rH  «1 

n 

c 

• 

‘  ' 

•H  Ml  -rt 

If  O  <XJ 

»J  tl  CU 

o  vs  r.  w 

*d 

u  (U  O  03 

rH 

o  o  o 

.O  UJ 

NO 

o  o  o 

0)  n  Vs  cu 

rH  A  A 

x  3  a)  x 

H  C  o.  sr 

I  / 

O'  O  CO  O  G"i  O  O  CTN  --H  O  CO  O  O  On  rH  rH  ON 

00  sr  A  A  O'  <r  (N  CM  H  o>  <T  CM  CS  H 

in  o  h  cn  M3  in  h  -j  n  o  m  n  h  m  co  cmji 

O  Nf  ON  Cl  <r  O  M  O  CM  N  «J  O  <f  O  N 

H  H  CM  in  CO  H  H  CM  lO  CO 

- - -  i - 

•  / 

Mn  n  n  oo  co  in  in  oo  o<  oo  to  n  m  co  o  co 

cn  cs  H  n  n  cm  -H  n  n  n  h  H 

On  1 —  A  A  NQ  ON  OO  iCi  CM  \f  in  On  00  NO  A  A  ST 

cn  00  CM  n-  o  Cl  CC  CM  00  N  N  n  CO  CM  OO  03  Oi 

» — C  i — I  nT  H  H  >J  In  rl  H  >J  A 


ci  cn  o  m  n  cioNHincoN 

O  CM  CM  H  M3  CM  N  H 


oi  h  m  oi  n 
ifl  CM  CM  H 


CM  H  N  N  ST  CNl  N  CM  «J  H  CM  CM  <  Cl  C  m 


Cl  C-  O  Sf  A 

i — I  r — I  Cl 


Cl  N  o  Ml  CM  CM 

H  rd  ST  N 


in  h  m  r-c  a 

•T  CM  H  H 


CM  CM  A  CM  o 


a  a  a  i  sr 
H  A 


t 

m  h  in  cm  N  rs 
sr  CM  H  H 


N  CM  N  lO  O'  Cl 


CM  in  Cl  H  lO  M3 
H  Cl  lC 


m  h  m  cm  n  n 
sr  CM  H  H 


Cl  N  N  M3  H  H 


csi  m  a  rH  a  a 
H  Cl  iO 


ci  mi  in  (n  oo 
n  m  cm  h 


on  n>  ci  in  io 


d  OO  CM  Ci  O 
t—i  •— c  sr 


co  m  m  oo  oi  co 

A  A  A  rH 


O'  co  n  n  •r  n 


Cl  OO  CM  CO  Cl  N 
H  H  if  Ci 


oo  m  m  co  o  co 

fs  Cl  N  H  H 


O'  00  iO  CM  CM  iT 


Cl  CO  CM  CO  CO  O' 
H  H  >T  Cn 


A  A  O  O  O 
IN  td  O  O 
rH  A 


O 

a  a  o  o  o  o 
in  in  o  o  o 

h  m  • 


in  in  o  o  o  o 
<n  m  o  o  o 
h  n  o 


o 

o 

o 


o 

o 

o 

in 


-  63- 


liaving  recognized  the  need  for  sampling,  the  decision  to  use  a  1  in  3 
sample  represented  a  compromise  between  savings  in  cost  and  time  and  the  need 
to  produce  reliable  data  for  small  areas.  As  mentioned  earlier,  it  was 
estimated  that  the  reliability  of  1971  Census  data  using  a  1  in  3  self¬ 
enumeration  sample  would  be  equivalent  to  that  achieved  by  a  100% 
canvasser  enumeration  in  1961.  The  actual  method  of  selecting  the  sample 
had  to  be  kept  simple  since  it  had  to  be  implemented  by  42,000  individual 
Census  representatives  each  in  her  own  kA.  A  systematic  sample  of  every 
third  household  was  used  with  a  random  start  in  each  E.A.  (i.e.  either  the 
1st,  2nd,  or  3rd  household  in  the  EA  was  selected  at  random  to  receive  a 
long-form  and  thereafter  every  third  household  in  the  EA  received  the  long- 
form).  One  aspect  of  this  sampling  scheme  that  is  being  evaluated  is  the 
possible  introduction  of  bias  into  the  sample  by  the  enumerator  at  drop¬ 
off.  There  is  a  danger  that  an  enumerator  might  tamper  with  the  sampling 
scheme  in  order  to  reduce  her  workload  (e.g.  by  giving  a  large  household  a 
short-form  when  the  sample  called  for  a  long-form) . 

The  sampling,  scheme  used  in  the  Census  assumes  that  1/3  of  private  dwellings 
are  included  in  the  sample.  however,  it  does  not  guarantee  that  exactly 
l/3rd  of  each  type  of  dwelling  will  be  in  the  sample.  Since,  from  the 
short-forms,  we  have  certain  basic  population  and  housing  data  on  a  100% 
basis,  we  can  improve  our  sample  estimates  by  using  a  weighting  procedure 
that  takes  account  of  this  100%  knowledge.  Thus,  instead  of  simply 
scaling  up  all  sample  data  by  a  factor  of  three  (  to  allow  for  the  one  in 
three  sample)  a  more  complex  weighting  procedure  has  been  adopted  which 
ensures  that  sample  estimates  will  be  consistent  with  100%  data  for  most 
basic  population  and  housing  characteristics.  The  use  of  this  procedure 
also  has  the  effect  of  reducing  the  sampling  variance  of  the  estimates  made 


-64  - 


from  the  sample.  The  RMSE  figures  calculated  from  the  response  variance 
study  will  include  the  effects  of  the  weighting  procedure  used  in  the 
Census . 


Summary 

The  impact  of  sampling  on  the  reliability  of  Census  data  is  greatest  for  small 
cells.  However,  even  with  100%  enumeration  small  cell  estimates  are  often 
subject  to  quite  large  non-sampling  errors  and  the  effect  of  sampling  is 
to  further  increase  the  unreliability  of  such  figures.  Some  general  guide¬ 
lines  to  the  expected  reliability  of  1971  Census  data  can  be  obtained  from 
Table  A.  Actual  estimates  of  RMSE  will  be  produced  by  the  response  variance 
study,  and  will  be  published  for  various  groups  of  characteristics  with  the 
Census  tables  to  which  they  relate. 


65 


CONFIDENTIALITY  AND  ITS  EFFECT  ON 


CENSUS  DATA 


Dr 


Mike  Murphy 


66 


CONFIDENTIALITY  AND  ITS  EFFECT  ON  CENSUS  DATA 

1  .  The  problem: 

It  has  become  necessary  for  the  Census  Division  to  apply  a  new 
technique  to  published  tables  to  guard  against  illegal  disclosure 
of  information  relating  to  individuals. 

The  Census  Division  was  forced  to  adopt  such  procedures  because  of: 

a.  the  new  Statistics  Act  with  a  new  definition  of  illegal  disclosure, 

b.  the  increased  public  concern  over  confidentiality  of  Census 
info  rma  t ion , 

c.  the  existence  of  new  programmes  such  as  Geocoding  and  the  summary 
tape  programme  that  will  place  in  the  hands  of  the  public  data  of 
unprecedented  scope  and  detail; 

d.  the  sheer  bulk  of  the  tabulation  programme,  some  3.8  million  pages 
of  computer  printout  in  the  regular  tabulation  programme,  which 
renders  the  previous  methods  of  hand  checking  unfeasible. 

The  disclosure  clause  in  the  new  Statistics  Act,  section  16  (I)(b), 

reads:  "...  no  person  who  has  been  sworn  under  section  6  shall  disclos 

or  knowingly  cause  to  be  disclosed,  by  any  means,  any  information 
obtained  under  this  Act  in  such  a  manner  that  it  is  possible  from  any 
such  disclosure  to  relate  the  particulars  obtained  from  any  individual 
return  to  any  identifiable  individual  person,  business  or  organization" 

One  discloses  information  in  a  statistical  table  by  first  giving  enough 
information  to  identify  an  individual  and  then  giving  additional  inform 
ation  about  him;  for  example,  if  one  reveals  in  a  table  that  the  only 
Doctor  in  an  area  makes  $55,000  per  year,  one  has  disclosed  information 


about  that  man . 


67 


A  further  problem  is  residual  disclosure  which  refers  to  disclosures  that  are 
obtained  by  combining  two  or  more  tables.  The  most  obvious  case  is  with  area. 

In  Geocoding  one  could  ask  for  data  for  two  large  areas  which  were  designed  so 
that  one  completely  overlaps  the  other  except  for  a  very  small  slice.  By  sub¬ 
tracting  one  table  from  the  other,  a  person  could  obtain  information  on  the  few 
individuals  living  in  the  small  area,  despite  the  fact  that  the  two  large  tables 
contained  no  illegal  disclosures. 

Residual  disclosure  is  an  essentially  new  problem  raised  by  our  new  retrieval 
programmes  such  as  Geocoding.  It  is  also  impossible  to  guard  against  residual 
disclosure  without  some  technique  that  affects  every  published  figure.  Even 
large  numbers  cannot  be  considered  "safe"  as  long  as  two  large  numbers  can  be 
subtracted  to  yield  a  small  one. 

The  Random  Rounding  Technique 

Random  rounding  has  an  effect  similar  to  conventional  rounding:  every  number 
in  every  table  is  a  multiple  of  five.  Furthermore,  in  any  table  derived  by  a 
linear  combination  of  published  tables,  every  number  will  be  a  multiple  of  five. 

Thus  it  will  be  impossible  to  attribute  any  information  published  to  an  identifiable 
individual  either  directly  or  by  manipulation  of  tables.  The  technique  gives  full 
protection  against  direct  and  residual  disclosure. 

It  was  not  thought  necessary  to  provide  an  example  of  this  technique,  since  anyone 
can  at  once  visualize  its  effect  -  all  numbers  in  the  tables  are  multiples  of  five, 
their  final  digit  is  either  5  or  0. 

Random  rounding,  however,  differs  from  conventional  rounding.  In  random  rounding 
the  direction  of  rounding,  whether  a  given  number  is  rounded  up  or  down,  is  determined 


68 


by  chance  rather  than  by  an  explicit  set  of  rules.  For  example,  with  conventional 
rounding  the  number  126  would  always  become  125;  with  random  rounding  126  would 
become  either  125  or  130. 


With  random  rounding,  the  computer  in  effect  stops  at  each  number  and  flips  a 
five  sided  coin.  Depending  upon  the  result  of  the  coin  flip,  the  number  is 
rounded  either  up  or  down.  The  probability  of  rounding  the  number  up  is  deter¬ 
mined  by  its  last  digit  (literally  by  the  equivalent  number  modulo  5,  i ... e « ,  the 
remainder  when  the  number  is  divided  by  five). 

A  number  is  rounded  upward  with  a  probability  r/5  where  r  is  the  remainder  when 
the  number  is  divided  by  5.  it  is  rounded  downward  witn  probability  1  -  r/5®  Thus 
the  following  probabilities  apply  when  the  rounding  base  is  5. 


Final  Digit 


Probability  of  Rounding  Up 


0  or  5  0 

1  or  6  1/5 

2  or  7  2/5 

3  or  8  3/5 

4  or  9  4/5 


This  random  feature  has  two  useful  effects.  First,  it  insures  against  the  possibility 
of  deriving  the  original  figures  by  comparing  cells  in  a  table  against  the  independently 
rounded  totals.  Second,  and  most  important,  it  makes  the  sum  of  the  rounded  numbers 
an  unbiased  estimate  of  the  sum  of  the  original  numbers.  This  will  not  be  the 
case  in  conventional  rounding  unless  there  is  an  even  distribution  of  last  digits. 

In  Census  data  there  tends  to  be  a  preponderance  of  small  last  digits,  and  if 


69 


conventional  rounding  were  used,  the  sum  of  the  rounded  numbers  would  tend  to 
underestimate  the  total  of  the  original  numbers. 

We  illustrate  with  a  simple  example.  Assume  that  one  is  summarizing  data  for  five 
EA's  and  that  a  given  cell  of  the  table  contains  a  1  in  all  five  EA  tables.  We 
would  have  the  following: 


Cell  ij 
for  EA  # 

Original 

Data 

Conventional 

Rounding 

Random 

Rounding 

(expected  distribution 

1 

1 

0 

0 

2 

1 

0 

5 

3 

1 

0 

0 

4 

1 

0 

0 

5 

1 

0 

0 

Total 

5 

0 

5 

Conventional  rounding  underestimates  the  true  total,  random  rounding  will  tend  to 
give  the  proper  total.  This  aggregation  problem  will  not  occur  in  published  tables 
when  the  totals  are  independently  rounded.  Then  the  published  total  is  on  the 
average  within  two  of  the  actual  total  and  always  within  four.  However,  conventional 
rounding  is  a  problem  when  published  data  are  used  as  ’building  blocks',  a  common 
practice  with  some  of  the  Census's  major  users,  federal  agencies  and  provincial 
governments . 

In  sum  the  technique  is  as  follows: 

1.  The  computer  inspects  the  first  number,  say  it  is  126; 

2.  The  computer  flips  its  five  sided  coin  (actually  its  random  number  generator); 


70 


3.  if  the  coin  comes  up  1,  the  number,  126,  is  rounded  up  to  130;  if 
the  coin  comes  up  2,  3,  4,  or  5,  the  number  is  rounded  to  125; 

4.  the  computer  continues  on  in  the  same  manner  for  all  numbers  in  the  table 

including  totals  and  sub-totals.  ...... 

The  Impact  of  the  Technique  Upon  the  Data 

There  are  two  possible  approaches  to  this  problem.  The  first  is  to  assume  that 
the  figures  in  an  unrounded  census  table  are  correct  and  show  that  the  rounding 
procedure  introduces  insignificant  changes  in  these  figures. 

A  second,  more  realistic,  approach  is  to  ask  how  accurate  the  Census  figures  are, 
given  the  sampling  and  non-sampling  errors  that  are  inherent  in  them  and  the 
editing  and  imputation  procedures  that  are  applied  to  them.  One  then  asks  if  the 
rounded  figures  are  any  less  accurate  than  the  original  figures,  noting  that  it 
is  only  in  the  very  small  numbers  that  rounding  introduces  large  percent  error. 
This  section  will  briefly  consider  the  latter  approach. 

The  reliability  of  small  numbers  was  treated  in  an  article  by  Beynon, 

1 

Ostry,  and  Platek  .  They  present  estimates  of  the  Root  Mean  Square  Error 
(RMSE)  of  published  Census  figures.  It  is.  worth  quoting  their  discussion  of 
RMSE. 


1.  T.  G.  Beynon,  S.  Ostry,  and  R.  Platek,  "Some  Methodological  Aspects 
of  the  1971  Census  in  Canada",  The  Canadian  Journal  of  Economics.  Ill,  No.  1, 
February  1970,  pp.  95-110. 


71 


"There  are  basically  two  types  of  error:  sampling  and  non-sampling  errors.  To 
facilitate  this  discussion  it  is  necessary  to  introduce  a  technical  term:  The 
Root  Mean  Square  Error  (RMSE).  It  is  a  statistical  measure  of  the  combined 
effect  of  both  types  of  error  and  it  has  the  property  that  an  estimate  plus  or 
minus  twice  its  RMSE  represents  an  interval  which  would  contain  the  true  (but 
unknown)  number  with  approximately  95  per  cent  probability.  An  estimate  plus 
or  minus  once  its  RMSE  represents  an  interval  which  would  contain  the  true 
(but  unknown)  number  with  a  probability  of  approximately  62  per  cent.  The 
"coefficient  of  variation"  (CV)  expresses  the  RMSE  as  a  percentage  of  the  estimate 
to  which  it  refers  (i.e.,  whose  "error"  it  measures).  Both  these  statistical 
measures,  RMSE  and  CV,  have  been  tabulated  for  various  cell  frequencies  and 
sample  sizes  and  will  be  discussed  later. 

It  should  be  noted  that  the  estimates  of  RMSE  and  related  measures  presented  in 

this  report  are  based  on  data  derived  from  a  series  of  studies  made  as  part  of  the 

1961  Census  evaluation  program.  While  they  represent  the  results  of  careful 

scientific  investigation,  there  were  a  number  of  gaps  in  the  information  available 

and  thus  no  single  estimate  or  group  of  estimates  can  be  viewed  on  its  own.  Rather, 

the  estimates  of  error  as  presented  here  are  properly  treated  as  broadly  representative 

2 

of  patterns  and  relationships". 

It  should  be  emphasized  that  theirs  are  broad  general  estimates,  that  for  some  types  of 
data  the  error  is  small;  for  some  it  is  larger.  The  original  article  brings  more  detail 
to  this  point. 

It  should  further  be  emphasized  that  these  estimates  ar6  underestimates  in  that  they 
do  not  include  processing  errors  -  errors  in  coding,  editing,  and  imputation.  Finally 
the  estimates  to  be  quoted  from  their  article  do  not  include  'time  lag1  error. 

We  reproduce  two  columns  from  their  Table  1: 


2.  IBID,  p.  96 


72 


RMSE  ESTIMATES  WITHOUT  TIME  FACTOR3 


100  % 

33  1/3  % 

Area  of  N 

Cell 

CV 

CV 

population 

f requency 

RMSE 

% 

RMSE 

% 

100 

5 

2.2 

44 

3.8 

76 

25 

4.6 

18 

7.6 

31 

50 

5.8 

12 

.  9.2 

18 

200 

5 

2.2 

45 

3.8 

77 

25 

4.9 

20 

8.2 

33 

50 

6.5 

14 

11.0 

22 

100 

9-3 

9 

13.6 

14 

500 

5 

2.2 

45 

3.? 

77 

25 

5.1 

20 

8.6 

34 

50 

7-3 

15 

12.0 

24 

100 

10.8 

11  ■ 

16 . 6 

17 

1,000 

5 

2.2 

45 

3-9 

77 

25 

5*2 

21 

8.7 

35 

50 

7-5 

15 

12.3 

25 

100 

11.2 

11 

17-5 

17 

500 

34.0 

7 

40.6 

8 

*  « 

5,000 

5 

2.2 

45 

3.9 

78 

25 

5.2 

21 

8.8 

35 

50 

7-7 

15 

12.5 

25 

100 

11.6 

12 

18.2 

18 

500 

36.7 

7 

47.4 

9 

1,000 

66. 3 

7 

77.5 

8 

The  authors  state; 

/ 

'•The  Census  Is 

the  only  comprehensive 

source 

of  small 

area  and 

detailed  cross-tabulated 

.  data.  For 

this 

reason, 

our  discussion  will 

focus  on  small  numbers. 

In  this  context, 

the  most 

noteworthy 

aspect  of  the 

i  data  provided  in  Table  1 

is  that 

small  numbers,  less  than 

5  or  10,  are  of 

limited 

reliability".  ** 

•3«  IBID,  p.  104 


4.  IBID,  p.  10? 


73 


As  an  example,  ve  translate  their  coefficient  of  variations  into 
confidence  limits  for  a  cell  frequency  of  5  from  a  100  per  cent  count  (We  use 
the  figures  for  N  =  100;  they  are  almost  the  sane  for  all  N.).  The  95  per  cent 
confidence  limits  are  +1*.U;  in.  other  words,  given  complete  count  data,  a  published 
figure  of  5  represents,  with  95  per  cent  certainty,  a  number  that  lies  somewhere 
between  0.6  and  9.1+. 

A  person  properly  using  a  census  tabulation  would  interpret  the  figure 
of  5  in  a  table  as  indicating  that  the  true  value  for  the  cell  lies  somewhere  between 
1  and  9.  He  would  know  just  as  much  if  the  numbers  in  the  table  had  been  rounded 
to  base  5;  when  he  sees  a  figure  of  5,  he  knows  that  it  represents  a  real  figure 
ranging  between  1  and  9.  . 

.  .  -  C; 

If  he  were  dealing  with  data  collected  on  a  1/3  sample,  the  uncertainly 
introduced  by  rounding  is  clearly  less  than  that  inherent  in  the  figure.  The  95 
per  cent  confidence  interval  for  a  figure  of  5  with  a  l/3rd  sample  is  0  to  12.2 

*  '  *  -9  *■  •  •  * 

with  N  =  100. 

We  can  compare  the  confidence  intervals  for  various  cell  frequencies 
with  the  range  or  uncertainty  introduced  by  rounding  (again  choosing  N  =.  100  and 
noting  that  the  coefficients  of  variation  are  roughly  the  same  for  all  N). 


Cell 

'frequency 

95  #  confidence  range 

100  %  count  1/3  samole 

Rounding  range 

5 

0.6  -  9.1* 

0  -  12.6 

1-9 

25 

16.0  -  3^.0 

lU.5  -  1*0.5 

21  -  29 

50 

38.0  -  62.0 

32.0  -  68.0 

1*6  -  5U 

These  figures  are  striking  even  considering  the  limitations  of  these 
RMSE  estimates.  In  no  case  does  the  rounding  make  the  figure  less  accurate.  The 


-  74  - 


rounded  numbers  always  contain  as  much  information  as  the  same  number  in  an 
unrounded  table.  A  user  seeing  a  5  in  a  sample  table  would  assume  that  the  true 
value  for  that  cell  lay  between  0  and  13  whether  the  5  appeared  in  either  a  rounded 
or  in  an  unrounded  table.  One  of  the  advantages  of  the  rounded  table  is  that  it 
will  force  the  user  to  think  in  terms  of  ranges,  and  these  ranges  will  be  in  almost 
all  cases  less  than  the  confidence  intervals  that  the  users  should  be  working  with 
in  any  case. 

To  summarize  we  could  quote  again  from  the  Beynon,  Ostry,  Platek  paper,  "What  is 

clear  is  that  small  numbers  will  contain  considerable  error,  where  derived  from 

5 

a  full  count  or  sample".  And  one  can  further  conclude  that  small  numbers  will 
contain  considerable  error  whether  rounded  or  not.  To  paraphrase  the  paper’s 
conclusion,  rounding  would  only  add  marginally  to  the  unreliability  of  already 
unreliable  data. 

.  Summary 

To  summarize  the  advantages  of  random  rounding: 

1.  It  gives  strong  protection  against  direct,  residual,  and  negative  disclosure 
with  one  technique; 

2.  It  is  easy  to  describe  and  easy  to  implement; 

3.  It  has  a  minimal  impact  upon  the  data; 

4.  It  permits  aggregation  of  the  published  figures; 

5.  Its  impact  upon  each  cell  of  a  table  is  inversely  related  to  the  size  of  that 
cell;  the  larger  the  figure  the  smaller  the  per  cent  change;  no  number  will 
be  changed  by  more  than  4* 


5.  IBID,  p.  107 


HA/743/ .R382 
Statistics  Canada. 
Reliability  of  the 
1971  census  data 


c  .  2 


dbgf 
tor  mai 


